Reputation: 459
I am using python in AWS Lambda function to list keys in a s3 bucket that begins with a specific id
for object in mybucket.objects.all():
file_name = os.path.basename(object.key)
match_id = file_name.split('_', 1)[0]
The problem is if a s3 bucket has several thousand files the iteration is very inefficient and sometimes lambda function times out
Here is an example file name
https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg
i want to only iterate objects that contains "012345" in the key name Any good suggestion on how i can accomplish that
Upvotes: 4
Views: 6946
Reputation: 13035
Here is how you need to solve it.
S3 stores everything as objects and there is no folder or filename. It is all for user convenience.
aws s3 ls s3://bucket/folder1/folder2/filenamepart --recursive
will get all s3 objects name that matches to that name.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
for obj in my_bucket.objects.filter(Prefix='012345'):
print(obj)
To speed up the list you can run multiple scripts parallelly.
Hope it helps.
Upvotes: 1
Reputation: 21274
You can improve speed by 30-40% by dropping os
and using string methods.
Depending on the assumptions you can make about the file path string, you can get additional speedups:
Using os.path.basename()
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
os.path.basename(fname).split("_")[0] == match
# 1.03 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Without os
, splitting first on /
and then on _
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("/")[-1].split("_")[0] == match
# 657 ns ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you know that the only underscores occur in the actual file name, you can use just one split()
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("_")[0][-6:] == match
# 388 ns ± 5.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Upvotes: 0