Artem Tikhomirov
Artem Tikhomirov

Reputation: 21676

Apache Flink: Reading parquet files from S3 in Data Stream APIs

We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.

Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?

Upvotes: 0

Views: 1178

Answers (1)

kkrugler
kkrugler

Reputation: 9245

The newer FileSource class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat() in particular, for reading Parquet files.

You use the FileSourceBuilder returned by the above method call, and then .monitorContinuously(Duration.ofHours(1)); (or whatever interval makes sense).

Upvotes: 1

Related Questions