Reputation: 377
I'm trying to get last added file in AWS s3 bucket using linux shell script. Can you anyone let me know how I can do this job?
Upvotes: 1
Views: 4795
Reputation: 18869
The best compromise for a simple command that is performant, at the time of this writing based on the simplistic performance test, would be aws s3 ls --recursive
(Option #2)
s3cmd
(See s3cmd Usage, or explore the man page after installing it using sudo pip install s3cmd
)
s3cmd ls s3://the-bucket | sort| tail -n 1
s3
aws s3 ls the-bucket --recursive --output text | sort | tail -n 1 | awk '{print $1"T"$2","$3","$4}'
(Note that awk
in the above refers to GNU awk. See this if you need to install this, as well as for any other GNU utilities on macOS)
s3api
(with either list-objects
or list-objects-v2
)
aws s3api list-objects-v2 --bucket the-bucket | jq -r '.[] | max_by(.LastModified) | [.Key, .LastModified, .Size]|@csv'
Note that both of the s3api
commands are paginated and handling the pagination is a fundamental improvement in v2
of the list-objects.
If the bucket has more than a 1000 objects (use s3cmd du "s3://ons-dap-s-logs" | awk '{print $2}'
to get the number of objects), then you'll need to handle pagination of the API and make multiple calls to get back all the results since the sort order of the returned results is UTF-8 binary order
and not 'Last Modified'.
Here is a simple performance comparison of the above three methods executed for the same bucket. For simplicity, the bucket had less than a 1000 objects. Here is the one-liner to see the execution times:
export bucket_name="the-bucket" && \
( \
time ( s3cmd ls --recursive "s3://${bucket_name}" | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1 ) & ; \
time ( aws s3 ls --recursive "${bucket_name}" --output text | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1 ) & ; \
time ( aws s3api list-objects-v2 --bucket "${bucket_name}" | jq -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) & ; \
time ( aws s3api list-objects --bucket "${bucket_name}" | jq -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) &
) >! output.log
(output.log
will store the last modified objects listed by each command)
The output of the above is as follows:
( s3cmd ls --recursive ...) 1.10s user 0.10s system 79% cpu 1.512 total
( aws s3 ls --recursive ...) 0.72s user 0.12s system 74% cpu 1.128 total
( aws s3api list-objects-v2 ...) 0.54s user 0.11s system 74% cpu 0.867 total
( aws s3api list-objects ...) 0.57s user 0.11s system 75% cpu 0.900 total
For the same number of objects being returned, aws s3api
calls are appreciably more performant; however, there is the additional (scripting) complexity for dealing with the pagination of the API.
Useful link(s):
See Leveraging s3 and s3api to understand the difference between aws s3
and aws s3api
Upvotes: 1
Reputation: 317
aws s3 ls s3://your-bucket --recursive | sort | tail -n 1
This command will recursively check all files in all folders and subfolders of an S3 bucket, and return the name of the file most recently modified as well as the timestamp of that modification.
(Note, awscli
should be installed first and configured with your AWS account info. See https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-configure-cli.html.)
Upvotes: 0
Reputation: 52393
One way is to use the output of s3cmd and sort
the output to get the last added file.
s3cmd ls s3://{{bucket}} | sort | tail -n 1 | awk '{print $2}'
sort
- sorts the output by creation timetail -n 1
- returns the last fileawk '{print $2}'
- prints the file nameUpvotes: 1
Reputation: 9411
That's not possible. S3 is not a database or filesystem.
However, with S3 queries you can request a list of objects that were created or modified after a certain date:
aws s3api list-objects --bucket "YOURBUCKET" --query 'Contents[?LastModified>=2016-12-27][].{Key: Key}'
And if you want only added objects, not modified, you'll have to create custom metadata attribute, save it with object and query based on that custom attribute.
Upvotes: 0