iamabug
iamabug

Reputation: 306

How to write to different files based on content for batch processing in Flink?

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:

  1. Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
  2. Read files as stream and use BucketingSink.

My question is how to make a choice between them, or is there another solution ? Any help is appreciated.

EDIT: This question may be a duplicate of this .

Upvotes: 0

Views: 281

Answers (1)

Bon Speedy
Bon Speedy

Reputation: 191

We faced the same problem. We too are surprised that DataSet does not support addSink().

I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.

You may have to implement your own OutputFormat to do the bucketing.

Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.

That's what we did and it's working great.

I hope flink would support this soon in Batch Mode soon.

Upvotes: 1

Related Questions