AvinashK
AvinashK

Reputation: 3423

Global sort in Flink 1.14 with Table and Datastream APIs

I have been looking at different global data sorting options for bounded data in Flink 1.14. I found that many questions about this on Stackoverflow and other sites are quite a few years old, about deprecated APIs or don't answer this question fully. As Flink is rapidly developing, I wanted to ask about the options available in the latest stable Flink (1.14).

Following is how I understand the current scenario (which may be wrong). My questions are also attached. Flink has two APIs - DataStream and Table - which may run in batch or streaming execution modes. The DataSet API is deprecated.

Batch Execution

Streaming Execution

Overall, I am unable to find a truly parallel implementations for sorting bounded datasets in Flink. Am I correct in my findings above?

Upvotes: 0

Views: 355

Answers (1)

David Anderson
David Anderson

Reputation: 43439

Given how Flink is organized, for batch I think the best that's possible will be to sort partitions of the data and then merge those sorted partitions. That final step cannot be done in parallel. I don't know if the Table/SQL API does anything like this automatically, but I suspect it might, after having taken a quick look at the source code.

You could ask about this on the flink user mailing list (https://flink.apache.org/community.html#how-to-subscribe-to-a-mailing-list).

For more insight on how batch workloads are executed by the SQL planner and how to tune them, I recommend https://flink.apache.org/2021/10/26/sort-shuffle-part1.html.

Upvotes: 1

Related Questions