Reputation: 406
I am using Google Dataflow via Java SDK. A GroupByKey transform returns an Iterable in the "value" part of the KV PCollection. Suppose we run a ParDo on the KV results of a GroupByKey transform. Could anyone let me know about the "nature" of the Iterable object: Does the Iterable hold a fully pre-populated list, which means that suppose there are 1000 Integers in the Iterable, it consumes memory of 1000*sizeof(Integer) on that node. Or, is the Iterable evaluated "lazily" (something like generators in Python) which ensures very minimal memory consumption no matter how large the Iterable is.
Upvotes: 0
Views: 576
Reputation: 17913
These iterables are lazy and, at least when running on Dataflow runner, they are allowed to hold more data per key than will fit in memory. Values for the key get loaded into memory lazily as you go through the Iterable.
Upvotes: 3