Reputation: 3214
I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches
the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?
Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?
Upvotes: 1
Views: 665
Reputation: 5104
GroupIntoBatches
will work. If you're running a batch pipeline and don't have a natural key on which to group (making a random one will often result in batches that are too small or parallelism that is too small and can interact poorly with liquid sharding) you should consider using BatchElements instead which can batch without keys and can be configured with either a fixed or dynamic batch size.
Upvotes: 1
Reputation: 2850
I think that GroupIntoBatches
is a good approach for this use case. Keep in
mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.
Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:
Upvotes: 1