Reputation: 11
I am working with Flink using data from a Kafka topic that has multiple partitions. Is it possible to have a window on each parallel sub-task/partition without having to use keyBy
(as I want to avoid the shuffle). Based on the documentation, I can only choose between keyed windows (which requires a shuffle) or global windows (which reduces parallelism to 1).
The motivation is that I want to use a CountWindow
to batch the messages with a custom trigger that also fires after a set amount of processing time. So per Kafka partition, I want to batch N records together or wait X amount of processing time before sending the batch downstream.
Thanks!
Upvotes: 1
Views: 283
Reputation: 43707
There's no good way to do that.
One workaround would be to implement the batching and timeout logic in a custom sink. You'd want to implement the CheckpointedFunction
interface to make your solution fault tolerant, and you could use the Sink.ProcessingTimeService.ProcessingTimeCallback
interface for the timeouts.
UPDATE:
Just thought of another solution, similar to the one in your comment below. You could implement a custom source that sends a periodic heartbeat, and broadcast that to a BroadcastProcessFunction.
Upvotes: 1