Reputation: 21676
We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.
Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?
Upvotes: 0
Views: 1178
Reputation: 9245
The newer FileSource
class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat()
in particular, for reading Parquet files.
You use the FileSourceBuilder
returned by the above method call, and then .monitorContinuously(Duration.ofHours(1));
(or whatever interval makes sense).
Upvotes: 1