Reputation: 306
I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink
(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset
does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
MultipleTextOutputFormat
or MultipleOutputs
;BucketingSink
.My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
Upvotes: 0
Views: 281
Reputation: 191
We faced the same problem. We too are surprised that DataSet
does not support addSink()
.
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD]
(or RichOutputFormat[]
) where you can still use the BucketAssigner[YOUR_RECORD, String]
to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.
Upvotes: 1