Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive
Talk video
Talk presentation
Apache Beam is one of the fastest-growing stream analytics and batch data processing frameworks. The Apache Software Foundation recognized it in 2018 as being in the top 3 Apache projects by commits and dev@ list activity. Users regularly run distributed data processing jobs on Beam spanning just a few CPU cores and going as high as tens of thousands of CPU cores.
The key to high performance and scalability of Beam are the advances in distributed data processing developed internally at Google to handle exabytes of data by its Gmail, YouTube and Ads products. After the initial publication by Google of the MapReduce paper and its successful implementation as Apache Hadoop, Google continued innovating on large scale data processing and later published the MillWheel paper, making it available as the Cloud Dataflow managed service in the Google Cloud and open-sourcing its SDK as Apache Beam.
In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will share the implementation details of many of the unique features available in Apache Beam and Cloud Dataflow, including:
- autoscaling of resources based on data inputs;
- separating compute and state storage for better scaling of resources;
- simultaneous grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system;
- dynamic work rebalancing of work items away from overutilized worker nodes and many others.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Sergei will also touch upon some of the unique properties of Apache Beam, such as:- its compatibility with Apache Flink, Spark, Samza and a half-dozen other popular data processing frameworks, allowing users to run the same Beam code on any cluster environment they choose;
- Beam and Dataflow offer one of the easiest ways to get started with Stream Analytics, by writing SQL against streams of data and joining them with tables and applying time-based windowing functions.
- Senior Product Manager for Cloud Dataflow, a unified streaming and batch data processing service in the Google Cloud.
- An active participant in the Apache Beam community and also volunteers for Razom for Ukraine.
- He is currently based in Seattle, United States, and enjoys hiking in the gorgeous Pacific Northwest with his wife and two sons.
- Twitter, GitHub