Reputation: 33

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.

That copy operation is done via S3 cli tool command that looks something like this:

aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile

The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.

So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?

Upvotes: 0

Answers (3)

Anandkumar

Reputation: 1512

if you don't want to write a script and if there is a pattern (like A*.csv) that you want to copy, (I know the question is copy certain number of files), some time it is random number of subset of files with a common pattern in the filename(S3 object name), that you might want to copy to test it.

Below command was very useful for me from CLI (but copy the files by the wildcard pattern though)

aws s3 cp  s3://noaa-gsod-pds/2022/  s3://<target_bucket_name>/2022/ --recursive --exclude '*' --include 'A*.csv'

If you want to write a script (below command can get you 10 objects from S3 bucket and you can script it to action(copy) on it)

aws s3api list-objects --max-items 10 --bucket noaa-gsod-pds | jq '.Contents' | jq '.[] | .Key'

noaa-gsod-pds - is a public bucket with some sample dataset
jq needs to be installed for the above command to work

Upvotes: 0

John Rotenstein

Reputation: 269400

The aws s3 sync command in the AWS CLI sounds perfect for your needs.

It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.

Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.

Upvotes: 1

KayD

Reputation: 826

You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

Upvotes: 1

Copy limited number of files from S3?

Answers (3)

Related Questions