In beam custom combine function, does serialization occur even if the object is on "same" machine?

Question

We have a custom combine function (on beam sdk 2.0) in which the millions of objects get accumulated but they do NOT necessarily get reduced....that is, they sometimes get added to a List such that eventually, the List might get quite large (hundreds of megabytes, even gigabytes).

To minimize the problem of having to "pass around" these objects (during merging of accumulators) between nodes, we've created a SINGLE giant node (of 64 cores, tonnes of RAM).

So, in "theory", dataflow does not need to serialize the List object (and any of these big objects in the List) even during "merge accumulator" operations, since all the objects are on the same node. But, does dataflow still serialize even if all the objects of interest are on the same node or is it smart enough to know that an object is on the same node vs separate nodes?

Ideally, when objects are on same node, we can just pass around references to the objects (rather than serializing/deserializing the contents of these objects, which can be very very large.) (I understand, of course, than when dealing with multiple nodes, there's no choice but to serialize/deserialize since the data has to be passed around somehow; but within a node, is beam sdk 2.0 smart enough to not serialize/deserialize during these combine functions, group by's etc.?)

In beam custom combine function, does serialization occur even if the object is on "same" machine?

Answers (1)

Related Questions

In beam custom combine function, does serialization occur even if the object is on &quot;same&quot; machine?

Answers (1)

Related Questions

In beam custom combine function, does serialization occur even if the object is on "same" machine?