Harvesting Data
Harvesting Data

Reputation: 170

Pick the most recent file from from AWS S3 for every file type using Python

I have a scenario where multiple files are present in an AWS S3 bucket. I need to be able pick the most recent file for every file type based on their last modified date. I can also use the numeric part of the file name as it's indicative of the hour_yearmonthday when the file was created.

The following two files needs to be picked as they were the last modified ones - File_A_02_20220728.csv and File_B_02_20220728.csv. Any suggestions / snippets on how to do this would be much appreciated.

s3://bucket/File_A_00_20220728.csv
s3://bucket/File_A_01_20220728.csv
s3://bucket/File_A_02_20220728.csv 
s3://bucket/File_B_00_20220728.csv
s3://bucket/File_B_01_20220728.csv
s3://bucket/File_B_02_20220728.csv

Upvotes: 0

Views: 2184

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269282

There is no in-built function for Amazon S3 to do this for you.

You would need to use list_objects_v2() to list the contents of the bucket. Then, use Python logic/lists/dictionaries to identify the files you want. I would recommend:

  • From the result set, create a list of extensions
  • Loop through each extension, finding the latest object for that extension

For an example of grouping by extension, see: Search S3 bucket for file extension and size

For an example of selecting the 'latest' object, see: How to get the latest file of an S3 bucket using Boto3?

Upvotes: 3

Related Questions