Reputation: 57
I have an s3 bucket with a hierarchy of folders like this:
Folder 1
Subfolder 1
Subsubfolder 1
Subsubsubfolder 1
Subsubsubfolder 2
Subsubfolder 2
Subsubsubfolder 1
Subsubsubfolder 2
Subfolder 2
Subsubfolder 1
Subsubsubfolder 1
Subsubsubfolder 2
Subsubfolder 2
Subsubsubfolder 1
Subsubsubfolder 1
I am trying to retrieve every folder and an overview of the structure within the bucket. I am currently using this code:
import boto3
s3 = boto3.client('s3')
bucket = "Bucket_name"
response = s3.list_objects_v2(Bucket=bucket)
for bucket in response['Contents']:
print(bucket['Key'])
This is getting me the filepath of every file in the last subfolders, and is not what i am looking for. Is there any way i can list only the folders and all the subfolders within the bucket?3
Upvotes: 1
Views: 5537
Reputation: 10828
If you want to mimic the behavior of the AWS CLI tool and other UI representations of S3, you need to pass a delimiter to any list objects call to tell S3 to group any objects with a shared prefix, and present them as something like a folder.
List objects will only return a batch of 1000 items. To properly enumerate a bucket, you need to take the NextContinuationToken
from the response and use it in another call till there isn't a continuation token. boto3 has a helper function called get_paginator
to handle this logic.
Putting it all together, you can list the objects in an S3 bucket with something like this. This includes showing how to present the output, with a format that looks vaguely like how aws s3 ls
works.
import boto3
from datetime import datetime
def enum_s3_items(s3, bucket_name, prefix="", delimiter="/"):
# Create a paginator to handle multiple pages from list_objects_v2
paginator = s3.get_paginator("list_objects_v2")
# Get each page from a call to list_objects_v2
for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter):
# Inside of each page, return the common prefixes (folders) first
for common_prefix in page.get("CommonPrefixes", []):
yield common_prefix
# And, if it's present, return each item in turn
for s3_object in page.get("Contents", []):
yield s3_object
s3 = boto3.client("s3")
for obj in enum_s3_items(s3, "example-bucket"):
# This is an example of how to process the output, in reality
# you would no doubt want to do something application specific
# with the results.
if 'Prefix' in obj:
# For common prefixes, just output the name of the prefix
# with some padding to mimic "aws s3 ls"
print(" " * 27 + "PRE " + obj['Prefix'])
else:
# Grab the interesting info out of the object to mimic
# how the cli works.
at = obj['LastModified']
# Conver to local time, just to mimic what the CLI does
at = at.astimezone(datetime.now().tzinfo)
# And pretty-print the datetime
at = at.strftime("%Y-%m-%d %H:%M:%S")
# Pull out other information
size = obj['Size']
key = obj['Key']
# Output to the console
print(f"{at} {size:10d} {key}")
Upvotes: 2
Reputation: 4529
You use prefix and you have to take care of pagination to really get all entries. So something like the above code should do the trick. This creates a generator object with all files/folders starting from prefix.
def get_all(bucket_name:str, prefix:str) -> Iterable[str]:
client = boto3.client("s3")
paginator = client.get_paginator("list_objects_v2")
pages = paginator.paginate(Bucket=bucket_name, Prefix=str(prefix))
for page in pages:
for obj in page.get("Contents", []):
yield obj["Key"]
all_ = list(get_all("a_bucket", "base_folder"))
Upvotes: 1