Batch elements in Apache Flink

Question

I have a stream of IDs in Apache Flink. I would like to batch them into sets of 500 and for each batch call an external service that will give me additional data for each ID. Then I want to forward each ID with the additional data further downstream. I'm using batching here for performance reasons because 1 request with 500 IDs is much faster than 500 requests with 1 ID.

I tried implementing this using windows, but I'm either getting tiny batches or no batches at all. In runtime execution mode BATCH I'm also losing the last remaining IDs.

Ideally I would like to:

Distribute the IDs across a configurable amount of workers, e.g. 10 on my local machine
Each worker accumulates IDs until it has 500, then it makes a request and sends the additional data downstream
When the upstream data source has finished, each worker should make one last request with what is left in the open batch

I'm a bit lost with the DataSet API, which functions should I use and how can I structure the program?

Batch elements in Apache Flink

Answers (1)

Related Questions