Reputation: 729
I am using the Dataset API with Flink and I am trying to partition parquet files by a key in my POJO e.g. date. The end goal is to write my files down using the following file structure.
/output/
20180901/
file.parquet
20180902/
file.parquet
Flink provides a convenience class to wrap AvroParquetOutputFormat
as shown below but I don't see anyway to provide a partitioning key.
HadoopOutputFormat<Void, Pojo> outputFormat =
new HadoopOutputFormat(new AvroParquetOutputFormat(), Job.getInstance());
I'm trying to figure out the best way to proceed. Do I need to write my own version of AvroParquetOutputFormat
that extends hadoops MultipleOutputs
type or can I leverage the Flink APIs to do this for me.
The equivalent in Spark would be.
df.write.partitionBy('date').parquet('base path')
Upvotes: 3
Views: 612
Reputation: 3713
You can use the BucketingSink<T>
sink to write data in partitions you defined by supplying an instance of the Bucketer
interface. See the DateTimeBucketer for an example.
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/DateTimeBucketer.java
Upvotes: 0