Sam McVeety
Sam McVeety

Reputation: 3214

Use of GroupIntoBatches on Bounded Source

I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?

Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?

Upvotes: 1

Views: 665

Answers (2)

robertwb
robertwb

Reputation: 5104

GroupIntoBatches will work. If you're running a batch pipeline and don't have a natural key on which to group (making a random one will often result in batches that are too small or parallelism that is too small and can interact poorly with liquid sharding) you should consider using BatchElements instead which can batch without keys and can be configured with either a fixed or dynamic batch size.

Upvotes: 1

Tlaquetzal
Tlaquetzal

Reputation: 2850

I think that GroupIntoBatches is a good approach for this use case. Keep in mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.

Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:

GroupIntoBatches

Upvotes: 1

Related Questions