foxwendy
foxwendy

Reputation: 2929

Beam and Datafow - How to do GroupByKey and sort faster?

I have over 100GB bounded data to process. The goal is to max the throughput. Basically, I need to segment data into groups, do sorting and then some ParDo work. The fowling code snippet shows how I did the session window and then do GroupByKey and Sort. I found GroupByKey is the bottleneck. By reading this blog , I understand that by doing some partial combination can significantly reduce data shuffle. However in my case, because I'm doing a sorting after GroupByKey, I guess data shuffle is going to be over 100GB anyways. So the question are:

  1. is there other ways that can increase GroupByKey throughput for my case?
  2. One workaround I can think of is that I can compose a query in BigQuery to do kind of the same thing (i.e. segment by data's time gps, group and sorting), and then just leave reset ParDos to dataflow. So that there's no groupby needed. But session windows is just so smart and the save me a lot of code that I really try to avoid do it "manually" by writing query in GBQ.

     PCollection<KV<String,TableRow>> sessionWindowedPairs = rowsKeyedByHardwareId
            .apply("Assign Rows Keyed by HardwareId into Session Windows"
                    , Window.into(Sessions.withGapDuration(Duration.standardSeconds(200))))
            ;
    
    
     PCollection<KV<String, List<TableRow>>> sortedGPSPerDevice = sessionWindowedPairs
            .apply("Group By HardwareId (and Window)", GroupByKey.create())
            .apply("Sort GPSs of Each Group By DateTime", ParDo.of(new SortTableRowDoFn()));
    

Upvotes: 1

Views: 4001

Answers (2)

Charles Chen
Charles Chen

Reputation: 346

It is not clear from the question details what the bottleneck is, or whether the pipeline is operating slower than is expected. For the Dataflow team to look at and give specific guidance, at least a job ID, and ideally pipeline code is needed. That said, we can give the following general advice:

See also the previous thread: Slow throughput after a groupBy.

Upvotes: 2

lucaboq
lucaboq

Reputation: 66

I too have been trying to process 100s of GBs of bounded data with Dataflow and have had no success. Long story short, it was also a group by key that was the bottle neck. In the end I came to the same conclusion that you reached in 2. and used BigQuery to do some high level aggregation to reduce the size of the data set to the order of GBs and Dataflow is much more performant.

I am not sure of the format of your data set but this blog post also discusses how to handle hot keys using combiners, which may suit your needs.

I'd love for a way for Dataflow to handle larger bounded data sets as I too would rather not have to maintain a BigQuery query and dataflow logic but I have not come across a way to do it.

Upvotes: 1

Related Questions