Reputation: 2599

List all the folders in a bucket - boto3

I'm working on a lambda function for which I need list of all the folders in a S3 bucket. I need to be able to traverse the each folder and get all the subfolders until the end of the tree is reached.

I implemented this by calling list_objects_v2 function recursively with different prefixes in boto3 and while it does work it is very slow and for buckets with alot of folders the lambda is exceeding the timeout of 15 minutes.

I wanted to know if there is a more efficient way of doing this.

Update: Sample output, this is what I'm getting right now by calling list_objects_v2 recursively.

L1/
L1/hist/
L1/hist/2022-01-03
L1/hist/2022-01-01
...

Update 2: Even after using a paginator like mentioned in the answers below, some buckets have so many objects that the lambda is still exceeding 15 min timeout. I'm not sure how to tackle this, please help!

Upvotes: 0

Answers (2)

Anon Coward

Reputation: 10828

You can enumerate through all of the objects in the bucket, and find the "folder" (really the prefix up until the last delimiter), and build up a list of available folders:

seen = set()
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket='bucket-name'):
    for obj in page.get('Contents', []):
        key = obj['Key']
        folder = key[:key.rindex("/")] if '/' in key else ""
        if folder not in seen:
            seen.add(folder)
            print(folder)

Alternatively, you could use the same basic logic of recursion that you were using, only with multiple workers, allowing you to alleviate some of the time spent waiting for replies:

def recursion_worker(bucket_name, prefix):
    # Look in the bucket at the given prefix, and return a list of folders
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')

    folders = []
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter='/'):
        for sub_prefix in page.get('CommonPrefixes', []):
            folders.append(sub_prefix['Prefix'])
        # S3 returns common prefixes first, if there are any objects, it 
        # means further pages will just return objects, so no need
        # to keep going
        # Note that this isn't a garuanteed order, though it's unlikely
        # to change. To be 100% safe, don't do this check at the expense
        # of a slower result set for large buckets
        if len(page.get('Contents', [])) > 0:
            break

    return folders

def folders_via_recursion():
    import multiprocessing
    from collections import deque

    bucket_name = 'example-bucket'
    folders = []
    # Spin up many workers, more than we have CPU cores for
    # since most of the workers will be spending most
    # of their time waiting for network traffic
    with multiprocessing.Pool(processes=32) as pool:
        pending = deque()
        # Seed the workers with the first request to list objects at
        # the root of the bucket
        pending.append(pool.apply_async(recursion_worker, (bucket_name, "")))
        while len(pending) > 0:
            # Keep going while there are items to parse
            temp = pending.popleft()
            for cur in temp.get():
                # Print out every folder, and store the result in an array
                # to consume later on
                print(cur)
                folders.append(cur)
                # And tell a free worker to list out the folder we just found
                pending.append(pool.apply_async(recursion_worker, (bucket_name, cur)))

    # All done, we can consume the folders array as needed, though
    # do note, it's in a somewhat random order, run something like
    # folders.sort()
    # first if you want a stable order before looking at it

If even this fails, then your bucket is simply too big to parse in one go in a Lambda. You'll need to consider some other solution to get a list of files, like using AWS S3 Inventory to create a list of objects outside of this Lambda that it can process.

Upvotes: 4

John Rotenstein

Reputation: 269081

The list_objects_v2() call returns a list of all objects. The Key of each object includes the full path of the object.

Therefore, you can simply extract the paths from the Keys of all objects:

import boto3

s3_client = boto3.client('s3')

response = s3_client.list_objects_v2(Bucket='my-bucket')

# folder1/folder2/foo.txt --> folder1/folder2
paths = {object['Key'][:object['Key'].rfind('/')] for object in response['Contents'] if '/' in object['Key']}

for path in sorted(paths):
    print(path)

If your bucket contains more than 1000 objects, then you will either need to loop through the results using ContinuationToken or use a paginator. See: list_objects_v2 paginator

Upvotes: 1

List all the folders in a bucket - boto3

Answers (2)

Related Questions