Reputation: 2599
I'm working on a lambda function for which I need list of all the folders in a S3 bucket. I need to be able to traverse the each folder and get all the subfolders until the end of the tree is reached.
I implemented this by calling list_objects_v2
function recursively with different prefixes in boto3
and while it does work it is very slow and for buckets with alot of folders the lambda is exceeding the timeout of 15 minutes.
I wanted to know if there is a more efficient way of doing this.
Update: Sample output, this is what I'm getting right now by calling list_objects_v2
recursively.
L1/
L1/hist/
L1/hist/2022-01-03
L1/hist/2022-01-01
...
Update 2: Even after using a paginator like mentioned in the answers below, some buckets have so many objects that the lambda is still exceeding 15 min timeout. I'm not sure how to tackle this, please help!
Upvotes: 0
Views: 3195
Reputation: 10828
You can enumerate through all of the objects in the bucket, and find the "folder" (really the prefix up until the last delimiter), and build up a list of available folders:
seen = set()
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket='bucket-name'):
for obj in page.get('Contents', []):
key = obj['Key']
folder = key[:key.rindex("/")] if '/' in key else ""
if folder not in seen:
seen.add(folder)
print(folder)
Alternatively, you could use the same basic logic of recursion that you were using, only with multiple workers, allowing you to alleviate some of the time spent waiting for replies:
def recursion_worker(bucket_name, prefix):
# Look in the bucket at the given prefix, and return a list of folders
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
folders = []
for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter='/'):
for sub_prefix in page.get('CommonPrefixes', []):
folders.append(sub_prefix['Prefix'])
# S3 returns common prefixes first, if there are any objects, it
# means further pages will just return objects, so no need
# to keep going
# Note that this isn't a garuanteed order, though it's unlikely
# to change. To be 100% safe, don't do this check at the expense
# of a slower result set for large buckets
if len(page.get('Contents', [])) > 0:
break
return folders
def folders_via_recursion():
import multiprocessing
from collections import deque
bucket_name = 'example-bucket'
folders = []
# Spin up many workers, more than we have CPU cores for
# since most of the workers will be spending most
# of their time waiting for network traffic
with multiprocessing.Pool(processes=32) as pool:
pending = deque()
# Seed the workers with the first request to list objects at
# the root of the bucket
pending.append(pool.apply_async(recursion_worker, (bucket_name, "")))
while len(pending) > 0:
# Keep going while there are items to parse
temp = pending.popleft()
for cur in temp.get():
# Print out every folder, and store the result in an array
# to consume later on
print(cur)
folders.append(cur)
# And tell a free worker to list out the folder we just found
pending.append(pool.apply_async(recursion_worker, (bucket_name, cur)))
# All done, we can consume the folders array as needed, though
# do note, it's in a somewhat random order, run something like
# folders.sort()
# first if you want a stable order before looking at it
If even this fails, then your bucket is simply too big to parse in one go in a Lambda. You'll need to consider some other solution to get a list of files, like using AWS S3 Inventory to create a list of objects outside of this Lambda that it can process.
Upvotes: 4
Reputation: 269081
The list_objects_v2()
call returns a list of all objects. The Key
of each object includes the full path of the object.
Therefore, you can simply extract the paths from the Keys of all objects:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='my-bucket')
# folder1/folder2/foo.txt --> folder1/folder2
paths = {object['Key'][:object['Key'].rfind('/')] for object in response['Contents'] if '/' in object['Key']}
for path in sorted(paths):
print(path)
If your bucket contains more than 1000 objects, then you will either need to loop through the results using ContinuationToken
or use a paginator. See: list_objects_v2
paginator
Upvotes: 1