codeexplorer
codeexplorer

Reputation: 459

AWS S3 list keys begins with a string

I am using python in AWS Lambda function to list keys in a s3 bucket that begins with a specific id

for object in mybucket.objects.all():
            file_name = os.path.basename(object.key)
            match_id = file_name.split('_', 1)[0]

The problem is if a s3 bucket has several thousand files the iteration is very inefficient and sometimes lambda function times out

Here is an example file name

https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg

i want to only iterate objects that contains "012345" in the key name Any good suggestion on how i can accomplish that

Upvotes: 4

Views: 6946

Answers (2)

Kannaiyan
Kannaiyan

Reputation: 13035

Here is how you need to solve it.

S3 stores everything as objects and there is no folder or filename. It is all for user convenience.

aws s3 ls s3://bucket/folder1/folder2/filenamepart --recursive

will get all s3 objects name that matches to that name.

import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
for obj in my_bucket.objects.filter(Prefix='012345'):
    print(obj)

To speed up the list you can run multiple scripts parallelly.

Hope it helps.

Upvotes: 1

andrew_reece
andrew_reece

Reputation: 21274

You can improve speed by 30-40% by dropping os and using string methods.
Depending on the assumptions you can make about the file path string, you can get additional speedups:

Using os.path.basename():

%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
os.path.basename(fname).split("_")[0] == match

# 1.03 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Without os, splitting first on / and then on _:

%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("/")[-1].split("_")[0] == match

# 657 ns ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

If you know that the only underscores occur in the actual file name, you can use just one split():

%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("_")[0][-6:] == match

# 388 ns ± 5.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Upvotes: 0

Related Questions