Reputation: 57

Listing every folder and its subfolders in AWS s3

I have an s3 bucket with a hierarchy of folders like this:

Folder 1
  
  Subfolder 1
    Subsubfolder 1
      Subsubsubfolder 1
      Subsubsubfolder 2
    
    Subsubfolder 2
      Subsubsubfolder 1
      Subsubsubfolder 2
  
  Subfolder 2
    Subsubfolder 1
      Subsubsubfolder 1
      Subsubsubfolder 2
    
    Subsubfolder 2
      Subsubsubfolder 1
      Subsubsubfolder 1

I am trying to retrieve every folder and an overview of the structure within the bucket. I am currently using this code:

import boto3

s3 = boto3.client('s3')
bucket = "Bucket_name"

response = s3.list_objects_v2(Bucket=bucket)



for bucket in response['Contents']:
    print(bucket['Key'])

This is getting me the filepath of every file in the last subfolders, and is not what i am looking for. Is there any way i can list only the folders and all the subfolders within the bucket?3

Upvotes: 1

Answers (2)

Anon Coward

Reputation: 10833

If you want to mimic the behavior of the AWS CLI tool and other UI representations of S3, you need to pass a delimiter to any list objects call to tell S3 to group any objects with a shared prefix, and present them as something like a folder.

List objects will only return a batch of 1000 items. To properly enumerate a bucket, you need to take the NextContinuationToken from the response and use it in another call till there isn't a continuation token. boto3 has a helper function called get_paginator to handle this logic.

Putting it all together, you can list the objects in an S3 bucket with something like this. This includes showing how to present the output, with a format that looks vaguely like how aws s3 ls works.

import boto3
from datetime import datetime

def enum_s3_items(s3, bucket_name, prefix="", delimiter="/"):
    # Create a paginator to handle multiple pages from list_objects_v2
    paginator = s3.get_paginator("list_objects_v2")
    # Get each page from a call to list_objects_v2
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter):
        # Inside of each page, return the common prefixes (folders) first
        for common_prefix in page.get("CommonPrefixes", []):
            yield common_prefix
        # And, if it's present, return each item in turn
        for s3_object in page.get("Contents", []):
            yield s3_object

s3 = boto3.client("s3")
for obj in enum_s3_items(s3, "example-bucket"):
    # This is an example of how to process the output, in reality
    # you would no doubt want to do something application specific
    # with the results.
    if 'Prefix' in obj:
        # For common prefixes, just output the name of the prefix
        # with some padding to mimic "aws s3 ls"
        print(" " * 27 + "PRE " + obj['Prefix'])
    else:
        # Grab the interesting info out of the object to mimic
        # how the cli works.
        at = obj['LastModified']
        # Conver to local time, just to mimic what the CLI does
        at = at.astimezone(datetime.now().tzinfo)
        # And pretty-print the datetime
        at = at.strftime("%Y-%m-%d %H:%M:%S")
        # Pull out other information
        size = obj['Size']
        key = obj['Key']
        # Output to the console
        print(f"{at} {size:10d} {key}")

Upvotes: 2

Simon Hawe

Reputation: 4539

You use prefix and you have to take care of pagination to really get all entries. So something like the above code should do the trick. This creates a generator object with all files/folders starting from prefix.

def get_all(bucket_name:str, prefix:str) -> Iterable[str]:
    client = boto3.client("s3")
    paginator = client.get_paginator("list_objects_v2")
    pages = paginator.paginate(Bucket=bucket_name, Prefix=str(prefix))
    for page in pages:
        for obj in page.get("Contents", []):
            yield obj["Key"]

all_ = list(get_all("a_bucket", "base_folder"))

Upvotes: 1

Listing every folder and its subfolders in AWS s3

Answers (2)

Related Questions