user3777228
user3777228

Reputation: 159

Can Google dataflow GroupByKey handle hot keys?

Input is PCollection<KV<String,String>> I have to write files by the key and each line as value of the KV group. In order to group based on Key, I have 2 options : 1. GroupByKey --> PCollection<KV<String, Iterable<String>>> 2. Combine.perKey.withhotKeyFanout --> PCollection where value String is accumulated Strings from all pairs. (Combine.CombineFn<String, List<String>, CustomStringObJ>)

I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key. I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.

If we use GroupByKey, how to handle hot keys?

Upvotes: 2

Views: 1312

Answers (1)

Kenn Knowles
Kenn Knowles

Reputation: 6033

You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.

Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.

Upvotes: 2

Related Questions