Shenjiaqi
Shenjiaqi

Reputation: 11

How can I sort data by fields in record in Batch-Execution mode using Flink DataStream api?

I need to write a batch flink job and I prefer to use DataStream api.

I want to make the output of the job sorted wihtin each output file(HDFS file). So that downstream can realy on these to speed up indexing.

Is there any replacement for DataSet#sortPartition in DataStream API?

I read the FLIP doc says that for data in KeyedStream are sorted by key binary representation, Can I use this feature? Say making a String key composed of fields I want to sort by?

Upvotes: 0

Views: 308

Answers (1)

kkrugler
kkrugler

Reputation: 9245

Sorting is inherently a batch operation, as you have to first collect all the data so that it can be sorted (assumes no ordering on input data).

I believe Flink SQL would support doing an ORDER BY, but I haven't tried that.

If you're writing straight-up Java code, I assume you'd have create a custom KeyedProcessFunction that saves all incoming records into say MapState using the sort field(s) as the keys, then in the close() method you'd sort the keys and finally emit everything. But I've never tried that. And David Anderson should weigh in here :)

Upvotes: 1

Related Questions