Cache locality in Flink

Question

I have a stream of data, containing a key, that I need to mix and match with data associated with that key. Each key belongs to a partition, and each partition can be loaded from a database.

Data is quite big and only a few hundred out of hundreds of thousands partitions can fit in a single task manager.

My current approach is to use partitionCustom based on the key.partition and cache the partition data inside a RichMapFunction to mix and match without reloading the data of the partitions multiple times.

When the number of message rate on a same partition gets too high, I hit a hot-spot/performance bottleneck.

What tools do I have in Flink to improve the throughput in this case?

Are there ways to customize the scheduling and to optimize the job placements based on setup time on the machines, and maximum processing time history?

kkrugler · Accepted Answer

It sounds like (a) your DB-based data is also partitioned, and (b) you have skew in your keys, where one partition gets a lot more keys than other partitions.

Assuming the above is correct, and you've done code profiling on your "mix and match" code to make that reasonably efficient, then you're left with manual optimizations. For example, if you know that keys in partition X are much more common, you can put all of those keys in one partition, and then distribute the remaining keys amongst the other partitions.

Another approach is to add a "batcher" operator, which puts up to N keys for the same partition into a group (typically this also needs a timeout to flush, so data doesn't get stuck). If you can batch enough keys, then it might not be so bad to load the DB data on demand for the partition associated with each batch of keys.

Cache locality in Flink

Answers (1)

Related Questions