Krzysztof Słowiński
Krzysztof Słowiński

Reputation: 7227

Get n last modified objects from an Amazon S3 bucket prefix using boto3

I need to get the list of object keys sorted by their last modified timestamp from S3 prefix. Since there is a lot of objects present, and I know I am interested in certain certain number of objects that have been recently modified, what would be the way to do that in boto3?

Sorting all the objects on the client side as I currently do takes very long time:

def get_last_modified(obj):
    return int(obj.last_modified.strftime("%s"))

def process(prefix):
    input_bucket = boto3.resource("s3").Bucket("my-test-bucket")
    objects = list(input_bucket.objects.filter(Prefix=prefix))
    sorted_objects = sorted(objects, key=get_last_modified, reverse=True)

Upvotes: 4

Views: 1827

Answers (1)

DeepLearnMD MD
DeepLearnMD MD

Reputation: 21

I have been searching around for a filtering option within boto3 however it seems it is not available out of the box. All solutions suggest pulling all files and then working with the result. The below is conditional on the naming convention of your files, but it does the trick.

Here is something that works for me and it seems might be relevant here too: You mention you are interested in file recently modified and that you have that timestamp in the prefix of the filename

In my case the files are named FOLDER/PREFIX_<TIMESTAMP>.json, with TIMESTAMP being the time of the generation of the file. You can use list_objects_v2 with StartAfter which filters ascending by name of file.

So my bucket looks like so:

FOLDER/PREFIX_1662634638.json.zip
FOLDER/PREFIX_1662634774.json.zip
FOLDER/PREFIX_1662634882.json.zip

In my case I can basically get files from the last X seconds and then filter clint-side if needed.

To get all files from the last hour, do the following (given file structure example above):

import boto3
import datetime

s3 = boto3.client('s3', 
      aws_access_key_id = ACCESS_KEY,
      aws_secret_access_key = SECRET_KEY)

last_hour = datetime.datetime.now() - datetime.timedelta(seconds=60*60)
last_hour_ts = int(last_hour.timestamp())

s3.list_objects_v2(
    Bucket = 'MY_BUCKET',
    Prefix = 'FOLDER/PREFIX', 
    StartAfter = f'FOLDER/PREFIX_{last_hour_ts}')

Hope that helps!

Upvotes: 2

Related Questions