Efficiently downloading files from S3 periodically using python boto3

Question

I'm trying to download the last 24hours of new files added to an S3 bucket - however, the S3 bucket contains a large number of files.

From my understanding s3 buckets use a flat structure where files are stored alphabetically based on the key name.

I've written a script to pull all the data stored on the bucket using threading. However, now I have all the files on my local system I want to update the database every 24hours with any new files that have been uploaded to S3.

Most forums recommend using 'last modified' to search for the correct files and then download the files that match the data specified.

Firstly, does downloading a file from the s3 bucket change the 'last modified'? Seems like this could cause problems.

Secondly, this seems like a really in-efficient process - searching through the entire bucket for files with the correct 'last modified' each time, then downloading... especially since the bucket contains a huge number of files. Is there a better way to achieve this?

Finally, does the pre-fix filter make this process any more efficient? or does this also require searching through all files.

Thanks in advance!

Caesar Kabalan · Accepted Answer

I'm going to go a different direction with this answer... You're right, that process is inefficient. I'm not sure the quantities and size of data you're dealing with but you're basically talking that you need a batch job to download new files. Searching a large number of keys is the wrong way to do it and is kind of an anti-pattern in AWS. At the root you need to keep track of new files as they come in.

The best way to solve this is using a Lambda Function (python since you're already familiar) that is triggered when a new object is deposited in your S3 bucket. What does that function do when a new file comes in?

If I had to solve this I would do one of the following:

Add the key of the new file to a DynamoDB table along with the timestamp. Throughout the day that table will grow whenever a new file comes in. When you're running your batch job read the contents of that table and download all the keys referenced, remove the row from the DynamoDB table. If you wanted to get fancy you could query based on the timestamp column and never clear rows from the table.
Copy the file to a second "pickup" bucket. When your batch job runs you just read all the files out of this pickup bucket and delete them. You have to be careful with this one. It's really easy but you have to consider the size/quantity of the files you're depositing so you don't run into the Lambda 5min execution limit.

I can't really recommend one over the other because I'm not familiar with your scale, cost appetite, etc. For a typical use case I would probably go with the DynamoDB table solution. I think you'll be surprised how easy DynamoDB is to interact with in Python3.

Efficiently downloading files from S3 periodically using python boto3

Answers (2)

Related Questions