largecats
largecats

Reputation: 175

Spark partitionBy equivalent in Flink?

I'm looking to write streaming data into separate folders based on partitions by key, like partitionBy in Spark.

Data:

key1 1
key2 1
key1 2
key3 1
...

Expected output:

/output_directory/key1
+-- part-0-0
+-- part-0-1
+-- ...
/output_directory/key2
+-- part-0-0
+-- part-0-1
+-- ...
/output_directory/key3
+-- part-0-0
+-- part-0-1
+-- ...
...

(I read about stream splitting and custom sinks in Flink. But it seems that the former is not suited to cases where the keys are not known in advance. As for custom sinks - I would like to see if there are other ways out before looking into them.)

Are there built-in methods or third-party libraries to achieve this in Flink?

Upvotes: 0

Views: 145

Answers (1)

David Anderson
David Anderson

Reputation: 43499

Take a look at Flink's FileSink, which you will want to use with a BucketAssigner that assigns stream elements to buckets based on their key.

For more details, see https://ci.apache.org/projects/flink/flink-docs-stable/docs/connectors/datastream/file_sink/.

Upvotes: 1

Related Questions