Apache Flink - Does DataSet API Support Writing Output to Individual File Partitions

Question

I am using the Dataset API with Flink and I am trying to partition parquet files by a key in my POJO e.g. date. The end goal is to write my files down using the following file structure.

/output/
    20180901/
        file.parquet
    20180902/
        file.parquet

Flink provides a convenience class to wrap AvroParquetOutputFormat as shown below but I don't see anyway to provide a partitioning key.

HadoopOutputFormat outputFormat = 
    new HadoopOutputFormat(new AvroParquetOutputFormat(), Job.getInstance());

I'm trying to figure out the best way to proceed. Do I need to write my own version of AvroParquetOutputFormat that extends hadoops MultipleOutputs type or can I leverage the Flink APIs to do this for me.

The equivalent in Spark would be.

df.write.partitionBy('date').parquet('base path')

Apache Flink - Does DataSet API Support Writing Output to Individual File Partitions

Answers (1)

Related Questions