Use of GroupIntoBatches on Bounded Source

Question

I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?

Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?

Tlaquetzal · Accepted Answer

I think that GroupIntoBatches is a good approach for this use case. Keep in mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.

Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:

Use of GroupIntoBatches on Bounded Source

Answers (2)

Related Questions