How to avoid redundant IO and deserialization when splitting Flink workload across multiple jobs

Question

To make deployment and management of our Flink workload easier we would like to run multiple smaller jobs rather than one large job that does everything. The problem we have is that each of these smaller jobs has to read and deserialize the same input data from Kafka.

We have done performance testing which shows that running multiple jobs, each reading the input data, uses more resources and takes longer than processing the same data as a single job.

Is there a way we can read the input data once and then run multiple jobs to do the processing steps only, or a way to at least reduce this overhead associated with running multiple jobs?

David Anderson · Accepted Answer

I think you have to expect to pay some price for this, but with some care you should be able to minimize the cost.

A few ideas:

(1) Use a serializer with good performance, e.g., protobuf. See the graph toward the end of this blog post.

(2) Structure things so that you can leverage reinterpretAsKeyedStream to avoid unnecessary keyBy's when you reingest previously keyed datastreams.

(3) You might also find Watermark propagation with Sink API interesting, since it relates to this topic.

How to avoid redundant IO and deserialization when splitting Flink workload across multiple jobs

Answers (1)

Related Questions