Anthony Naddeo
Anthony Naddeo

Reputation: 2751

What determines batch size in beam/dataflow?

I have a pipeline that uses the batch variant of DoFn (which the docs weren't very helpful for). It looks like this

class MyFn(beam.DoFn):

    def process_batch(self, batch: List[MyType]) -> Iterator[List[MyType]]:
        # process batches
        results = []
        for foo in batch:
            # do work, add to results

        yield results

I've got some logging setup to shows me that my process_batch method is operating on 4096 items consistently. Does anyone know why its 4096, or how to make it higher or lower?

Upvotes: 1

Views: 729

Answers (1)

Currently the BATCH SIZE is hardcoded to 4096 in the batched DoFns. You can raise a feature request in the apache beam community to make it configurable.

Upvotes: 2

Related Questions