Reputation: 11
I need to write a batch flink job and I prefer to use DataStream api.
I want to make the output of the job sorted wihtin each output file(HDFS file). So that downstream can realy on these to speed up indexing.
Is there any replacement for DataSet#sortPartition in DataStream API?
I read the FLIP doc says that for data in KeyedStream are sorted by key binary representation, Can I use this feature? Say making a String key composed of fields I want to sort by?
Upvotes: 0
Views: 308
Reputation: 9245
Sorting is inherently a batch operation, as you have to first collect all the data so that it can be sorted (assumes no ordering on input data).
I believe Flink SQL would support doing an ORDER BY
, but I haven't tried that.
If you're writing straight-up Java code, I assume you'd have create a custom KeyedProcessFunction
that saves all incoming records into say MapState
using the sort field(s) as the keys, then in the close()
method you'd sort the keys and finally emit everything. But I've never tried that. And David Anderson should weigh in here :)
Upvotes: 1