How to preserve order of records when implementing an ETL job with Flink?

Question

Suppose I want to implement an ETL job with Flink, source and sink of which are both Kafka topic with only one partition.
Order of records in source and sink matters to downstream(There are more jobs consume sink of my ETL, jobs are maintained by other teams.).
Is there any way make sure order of records in sink same as source, and make parallelism more than 1?

David Anderson · Accepted Answer

https://stackoverflow.com/a/69094404/2000823 covers parts of your question. The basic principle is that two events will maintain their relative ordering so long as they take the same path through the execution graph. Otherwise, the events will race against each other, and there is no guarantee regarding ordering.

If your job only has FORWARD connections between the tasks, then the order will always be preserved. If you use keyBy or rebalance (to change the parallel), then it will not.

A Kafka topic with one partition cannot be read from (or written to) in parallel. You can increase the parallelism of the job, but this will only have a meaningful effect on intermediate tasks (since in this case the source and sink cannot operate in parallel) -- which then introduces the possibility of events ending up out-of-order.

If it's enough to maintain the ordering on a key-by-key basis, then with just one partition, you'll always be fine. With multiple partitions being consumed in parallel, then if you use keyBy (or GROUP BY in SQL), you'll be okay only if all events for a key are always in the same Kafka partition.

How to preserve order of records when implementing an ETL job with Flink?

Answers (1)

Related Questions