Rusty75
Rusty75

Reputation: 527

Cheapest way in aws to move/obtain a subset of files based on date

The main question I have is:
how can i move files based on a date range, without incurring client-side api calls that cost money?

Background:
I want to download a subset of files from an AWS S3 bucket onto a linux server but there are millions of them in ONE folder, with nothing differentiating them except a sequence number; and I need a subset of these based on creation date. (well actually, inside the files is an event timestamp, so I want to reduce the bulk first by creation date).

I have frankly no idea what costs I am incurring, everytime I do an ls on that dataset , e.g. for testing.

Right now I am considering:

aws s3api list-objects --bucket "${S3_BUCKET}" --prefix "${path_from}" --query "Contents[?LastModified>='${low_extract_date}'].{Key: Key}"

but that is client-side if I understand correctly. So I would like to just move the relevant files to a different folder first, based on creation date.

Then just run aws S3 ls on that set.

Is that possible?

Because in that case, I would either:

  1. move files to another folder, while limiting to the date range that I am interested in (2-5%)
  2. list all these files (as I understand, this is where the costs are created?) and subsequently, extract them (and move them to archive)
  3. remove the subset folder

or:

  1. sync bucket into a new bucket
  2. delete all files I dont need from that bucket (older than date X)
  3. run ls on the remaining set

or: some other way?

And: is that cheaper than listing the files using the query?

thanks!

PS so to clarify: i wish to do a server-side operation to reduce the set initially and then list the result.

Upvotes: 1

Views: 1169

Answers (1)

Chris Williams
Chris Williams

Reputation: 35188

I believe a good approach to this would be the following:

  • If your instance is in a VPC create a VPC endpoint for S3 to allow direct private connection to Amazon S3 rather than going across the internet
  • Move the object keys that you want, to include a prefix (preferably Y/m/d) e.g. prefix/randomfile.txt might become 2020/07/04/randomfile.txt. If you're planning on scrapping the rest of the files then move it to a new bucket rather than in the same bucket.
  • Get objects based on the prefix (for all files for this month the prefix would be 2020/07

From the CLI you can move a file using the current syntax

aws s3 mv s3://bucketname/prefix/randomfile.txt s3://bucketname/2020/07/04/randomfile.txt

To copy the files for a specific prefix you could run the following on the CLI

aws s3 cp s3://bucketname/2020/07 .

To get files on a specific date you can run the below

aws s3api list-objects-v2 --bucket bucketname --query 'Contents[?contains(LastModified, `$DATE`)]'

The results of running this would need to be run via the CLI

Upvotes: 1

Related Questions