Hazzamataza
Hazzamataza

Reputation: 993

Efficiently downloading files from S3 periodically using python boto3

I'm trying to download the last 24hours of new files added to an S3 bucket - however, the S3 bucket contains a large number of files.

From my understanding s3 buckets use a flat structure where files are stored alphabetically based on the key name.

I've written a script to pull all the data stored on the bucket using threading. However, now I have all the files on my local system I want to update the database every 24hours with any new files that have been uploaded to S3.

Most forums recommend using 'last modified' to search for the correct files and then download the files that match the data specified.

Firstly, does downloading a file from the s3 bucket change the 'last modified'? Seems like this could cause problems.

Secondly, this seems like a really in-efficient process - searching through the entire bucket for files with the correct 'last modified' each time, then downloading... especially since the bucket contains a huge number of files. Is there a better way to achieve this?

Finally, does the pre-fix filter make this process any more efficient? or does this also require searching through all files.

Thanks in advance!

Upvotes: 0

Views: 2655

Answers (2)

Stephen
Stephen

Reputation: 3727

Another solution to add here..

You could enable inventory on S3 which gives you a daily report of all files in the bucket, including meta data such as date in CSV format.

When the CSV is generated (first one can take 48hours) you are able to generate a list of new files that you can download accordingly. The dynamo lambda option mentioned before will definitely give you a more real-time solution.

Also, I think modified date is only affected by PUT and POST actions

Upvotes: 1

Caesar Kabalan
Caesar Kabalan

Reputation: 791

I'm going to go a different direction with this answer... You're right, that process is inefficient. I'm not sure the quantities and size of data you're dealing with but you're basically talking that you need a batch job to download new files. Searching a large number of keys is the wrong way to do it and is kind of an anti-pattern in AWS. At the root you need to keep track of new files as they come in.

The best way to solve this is using a Lambda Function (python since you're already familiar) that is triggered when a new object is deposited in your S3 bucket. What does that function do when a new file comes in?

If I had to solve this I would do one of the following:

  • Add the key of the new file to a DynamoDB table along with the timestamp. Throughout the day that table will grow whenever a new file comes in. When you're running your batch job read the contents of that table and download all the keys referenced, remove the row from the DynamoDB table. If you wanted to get fancy you could query based on the timestamp column and never clear rows from the table.
  • Copy the file to a second "pickup" bucket. When your batch job runs you just read all the files out of this pickup bucket and delete them. You have to be careful with this one. It's really easy but you have to consider the size/quantity of the files you're depositing so you don't run into the Lambda 5min execution limit.

I can't really recommend one over the other because I'm not familiar with your scale, cost appetite, etc. For a typical use case I would probably go with the DynamoDB table solution. I think you'll be surprised how easy DynamoDB is to interact with in Python3.

Upvotes: 1

Related Questions