Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21561

Enforcing well balanced parallelism in a unkeyed Flink stream

Based on my understanding of Flink, it introduces parallelism based on keys (keygroups). However, suppose one had a massive unkeyed stream and would like the work to be done in parallel, what would be the best way to achieve this?

If the stream has some fields, one might think about keying by one of the fields arbirtrarily, however this does not guarantee that the workload will be balanced properly. For instance because one value in that field may occur in 90% of the messages. Hence my question:

How to enforce well balanced parallelism in Flink, without prior knowledge of what is in the stream


One potential solution I could think of is to assign a random number to each message (say 1-3 if you want to have a parallelism of 3, or 1-1000 if you want parallelism to be more flexible). However, I wondered if this was the recommended approach as it does not feel very elegant.

Upvotes: 2

Views: 1242

Answers (1)

David Anderson
David Anderson

Reputation: 43707

keyBy is one way to specify stream partitioning, and it is especially useful, since you are guaranteed that all stream elements with the same key will be processed together. This is the basis for stateful stream processing with Flink.

However, if you don't need to use key-partitioned state, and instead care about ensuring that the partitions are well balanced, you can use shuffle() or rebalance() to cause a random or round-robin partitioning. See the docs for more details. You can also implement a custom partitioner, if you want more explicit control.

BTW, if you do want to key the stream by a random number, do not do something like keyBy(new Random.nextInt(n)). It's imperative that the key selector be deterministic. This is necessary because the keys do not travel with the stream records -- instead, the key selector function is used to compute the key whenever it is needed. So for random keying, add another field to your events and populate it with a random number, and use that as the key. This technique is useful when you want to use keyed state or timers, but don't have anything suitable to use as a key.

Upvotes: 6

Related Questions