Reputation: 33
We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
Upvotes: 0
Views: 1171
Reputation: 1512
if you don't want to write a script and if there is a pattern (like A*.csv) that you want to copy, (I know the question is copy certain number of files), some time it is random number of subset of files with a common pattern in the filename(S3 object name), that you might want to copy to test it.
Below command was very useful for me from CLI (but copy the files by the wildcard pattern though)
aws s3 cp s3://noaa-gsod-pds/2022/ s3://<target_bucket_name>/2022/ --recursive --exclude '*' --include 'A*.csv'
If you want to write a script (below command can get you 10 objects from S3 bucket and you can script it to action(copy) on it)
aws s3api list-objects --max-items 10 --bucket noaa-gsod-pds | jq '.Contents' | jq '.[] | .Key'
Upvotes: 0
Reputation: 269400
The aws s3 sync
command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
Upvotes: 1
Reputation: 826
Upvotes: 1