How to apply job only on specific partition using AWS Glue

Question

I have JSON data in S3 bucket, partitioned on an hourly basis. For example, $bucketname/year=2020/month=07/day=07/hour=01, $bucketname/year=2020/month=07/day=07/hour=02, and so on. I am trying to create a GLUE job that transforms the JSON above into Parquet , into another S3 bucket.

I want to transform the data hourly, (or daily can be also fine) however, when I specify datasource in GLUE job script, it should be the whole data itself I mentioned above. My purpose is to only convert data stacked during an hour into parquet, but GLUE seems not to provide this kind of functionality.

The workaround I've thought of is to crawl S3 on the lowest level (ex. on $bucketname/year=2020/month=07/day=07/hour=01 level, and not on $bucketname itself). However, this kind of workaround doesn't allow me to set Hour based partitioning on the created Parquet.

Is there any suggestions to achieve my goal? Thanks much in advance.

Prabhakar Reddy · Accepted Answer

Glue has a feature called job bookmarks which process only the new data that has arrived after the initial run. Please refer to this to know more on how you can leverage this to process only the latest data.

How to apply job only on specific partition using AWS Glue

Answers (1)

Related Questions