Beam and Datafow - How to do GroupByKey and sort faster?

I have over 100GB bounded data to process. The goal is to max the throughput. Basically, I need to segment data into groups, do sorting and then some ParDo work. The fowling code snippet shows how I did the session window and then do GroupByKey and Sort. I found GroupByKey is the bottleneck. By reading this blog , I understand that by doing some partial combination can significantly reduce data shuffle. However in my case, because I'm doing a sorting after GroupByKey, I guess data shuffle is going to be over 100GB anyways. So the question are:

is there other ways that can increase GroupByKey throughput for my case?

One workaround I can think of is that I can compose a query in BigQuery to do kind of the same thing (i.e. segment by data's time gps, group and sorting), and then just leave reset ParDos to dataflow. So that there's no groupby needed. But session windows is just so smart and the save me a lot of code that I really try to avoid do it "manually" by writing query in GBQ.

 PCollection<KV<String,TableRow>> sessionWindowedPairs = rowsKeyedByHardwareId
        .apply("Assign Rows Keyed by HardwareId into Session Windows"
                , Window.into(Sessions.withGapDuration(Duration.standardSeconds(200))))
        ;


 PCollection<KV<String, List<TableRow>>> sortedGPSPerDevice = sessionWindowedPairs
        .apply("Group By HardwareId (and Window)", GroupByKey.create())
        .apply("Sort GPSs of Each Group By DateTime", ParDo.of(new SortTableRowDoFn()));

Upvotes: 1

Answers (2)

Charles Chen

Reputation: 346

It is not clear from the question details what the bottleneck is, or whether the pipeline is operating slower than is expected. For the Dataflow team to look at and give specific guidance, at least a job ID, and ideally pipeline code is needed. That said, we can give the following general advice:

The user can enable the next-generation shuffle mechanism detailed here (https://cloud.google.com/blog/big-data/2017/06/introducing-cloud-dataflow-shuffle-for-up-to-5x-performance-improvement-in-data-analytic-pipelines) by using the option --experiments=shuffle_mode=service
The user can use Combine.PerKeyWithHotKeyFanout if they know their pipeline contains a hot key: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/Combine.PerKeyWithHotKeyFanout.html.

See also the previous thread: Slow throughput after a groupBy.

Upvotes: 2

lucaboq

Reputation: 66

I too have been trying to process 100s of GBs of bounded data with Dataflow and have had no success. Long story short, it was also a group by key that was the bottle neck. In the end I came to the same conclusion that you reached in 2. and used BigQuery to do some high level aggregation to reduce the size of the data set to the order of GBs and Dataflow is much more performant.

I am not sure of the format of your data set but this blog post also discusses how to handle hot keys using combiners, which may suit your needs.

I'd love for a way for Dataflow to handle larger bounded data sets as I too would rather not have to maintain a BigQuery query and dataflow logic but I have not come across a way to do it.

Upvotes: 1

Beam and Datafow - How to do GroupByKey and sort faster?

Answers (2)

Related Questions