Reputation: 794
I have an S3 bucket where my application saves some final result DataFrames as .csv files. I would like to download the latest 1000 files in this bucket, but I don't know how to do it.
I cannot do it mannualy, as the bucket doesn't allow me to sort the files by date because it has more than 1000 elements
I've seen some questions that could work using AWS CLI, but I don't have enough user permissions to use the AWS CLI, so I have to do it with a boto3
python script that I'm going to upload into a lambda.
How can I do this?
Upvotes: 1
Views: 817
Reputation: 121
If your application uploads files periodically, you could try this:
import boto3
import datetime
last_n_days = 250
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='processed')
date_limit = datetime.datetime.now() - datetime.timedelta(30)
for page in pages:
for obj in page['Contents']:
if obj['LastModified'] >= date_limit and obj['Key'][-1] != '/':
s3.download_file('bucket', obj['Key'], obj['Key'].split('/')[-1])
With the script above, all files modified in the last 250 days will be downloaded. If your application uploads 4 files per day, this could do the fix.
Upvotes: 2
Reputation: 4486
The best solution is to redefine your problem: rather than retrieving the N most recent files, retrieve all files from the N most recent days. I think that you'll find this to be a better solution in most cases.
However, to make it work you'll need to adopt some form of date-stamped prefix for the uploaded files. For example, 2021-04-16/myfile.csv
.
If you feel that you must retrieve N files, then you can use the prefix to retrieve only a portion of the list. Assuming that you know that you have approximately 100 files uploaded per day, then start your bucket listing with 2021-04-05/
.
Upvotes: 0