Yeahprettymuch
Yeahprettymuch

Reputation: 541

Fetching only last 100 or so files (most recent files) from an S3 bucket

I'm building a process that will send customizable alerts based on the last received date of a file to an S3 bucket.

Because my bucket is huge, doing something like this takes a very long time to run:

import boto3
s3 = boto3.resource('s3',aws_access_key_id='demo', aws_secret_access_key='demo')

my_bucket = s3.Bucket('demo')

bucket_items = my_bucket.objects.all():

I could of course simply do the above, and then sort by the last_modified attribute, but I wonder whether there's a more elegant way to sift out just the 100 most recent files themselves when the API call is being made.

Ideally, I'd also want to be able to customize this even further with search strings - i.e. I might want the 100 most recent files that have ".docx" in the file name, or I might want the most recent files above 1MB in size - etc.

Just wondering what the best practices are for this kind of querying when the contents of the entire bucket are not needed.

Upvotes: 1

Views: 1024

Answers (2)

John Rotenstein
John Rotenstein

Reputation: 269282

Your available options are:

  • Retrieve a list of objects from the bucket: But this is slow if you have a large number of objects (10,000+) -- but using Prefixes can make this a lot faster, or
  • Obtain a daily listing via Amazon S3 Inventory: But it sounds like you want information more up-to-date than daily, or
  • Maintain your own database of objects

To maintain your own database of objects:

  • Create an Amazon S3 event that triggers an AWS Lambda function whenever objects are created/updated/deleted
  • The AWS Lambda function should store this information in a database (you would need to write this functionality)
  • You can then query the database for all of your requirements

Upvotes: 2

Chuong Nguyen
Chuong Nguyen

Reputation: 1162

About the 100 most recent files, you can use list_objects in boto3. In return, there are 'LastModified' field to sort and get the file needed. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects

For filtering, you can use this code to list all objects and add some code to download using something like this.

srcbucket = 'bucket'
srckey = 'object'
obj = s3.Object(srcbucket, srckey)

Upvotes: 1

Related Questions