john
john

Reputation: 729

Apache Flink - Does DataSet API Support Writing Output to Individual File Partitions

I am using the Dataset API with Flink and I am trying to partition parquet files by a key in my POJO e.g. date. The end goal is to write my files down using the following file structure.

/output/
    20180901/
        file.parquet
    20180902/
        file.parquet

Flink provides a convenience class to wrap AvroParquetOutputFormat as shown below but I don't see anyway to provide a partitioning key.

HadoopOutputFormat<Void, Pojo> outputFormat = 
    new HadoopOutputFormat(new AvroParquetOutputFormat(), Job.getInstance());

I'm trying to figure out the best way to proceed. Do I need to write my own version of AvroParquetOutputFormat that extends hadoops MultipleOutputs type or can I leverage the Flink APIs to do this for me.

The equivalent in Spark would be.

df.write.partitionBy('date').parquet('base path')

Upvotes: 3

Views: 612

Answers (1)

Zack Bartel
Zack Bartel

Reputation: 3713

You can use the BucketingSink<T> sink to write data in partitions you defined by supplying an instance of the Bucketer interface. See the DateTimeBucketer for an example. https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/DateTimeBucketer.java

Upvotes: 0

Related Questions