Reputation: 993
I'm trying to download the last 24hours of new files added to an S3 bucket - however, the S3 bucket contains a large number of files.
From my understanding s3 buckets use a flat structure where files are stored alphabetically based on the key name.
I've written a script to pull all the data stored on the bucket using threading. However, now I have all the files on my local system I want to update the database every 24hours with any new files that have been uploaded to S3.
Most forums recommend using 'last modified' to search for the correct files and then download the files that match the data specified.
Firstly, does downloading a file from the s3 bucket change the 'last modified'? Seems like this could cause problems.
Secondly, this seems like a really in-efficient process - searching through the entire bucket for files with the correct 'last modified' each time, then downloading... especially since the bucket contains a huge number of files. Is there a better way to achieve this?
Finally, does the pre-fix filter make this process any more efficient? or does this also require searching through all files.
Thanks in advance!
Upvotes: 0
Views: 2655
Reputation: 3727
Another solution to add here..
You could enable inventory on S3 which gives you a daily report of all files in the bucket, including meta data such as date in CSV format.
When the CSV is generated (first one can take 48hours) you are able to generate a list of new files that you can download accordingly. The dynamo lambda option mentioned before will definitely give you a more real-time solution.
Also, I think modified date is only affected by PUT and POST actions
Upvotes: 1
Reputation: 791
I'm going to go a different direction with this answer... You're right, that process is inefficient. I'm not sure the quantities and size of data you're dealing with but you're basically talking that you need a batch job to download new files. Searching a large number of keys is the wrong way to do it and is kind of an anti-pattern in AWS. At the root you need to keep track of new files as they come in.
The best way to solve this is using a Lambda Function (python since you're already familiar) that is triggered when a new object is deposited in your S3 bucket. What does that function do when a new file comes in?
If I had to solve this I would do one of the following:
I can't really recommend one over the other because I'm not familiar with your scale, cost appetite, etc. For a typical use case I would probably go with the DynamoDB table solution. I think you'll be surprised how easy DynamoDB is to interact with in Python3.
Upvotes: 1