ThatComputerGuy
ThatComputerGuy

Reputation: 373

Get count of objects in a specific S3 folder using Boto3

Trying to get count of objects in S3 folder

Current code

bucket='some-bucket'
File='someLocation/File/'

objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=File)
fileCount = objs['KeyCount']

This gives me the count as 1+actual number of objects in S3.

Maybe it is counting "File" as a key too?

Upvotes: 20

Views: 44054

Answers (8)

DrStrangepork
DrStrangepork

Reputation: 3134

The 'list_objects_v2' paginator includes the key 'KeyCount' which is the number of keys returned with the request. So sum() the values of 'KeyCount' for every page (default PageSize=1000):

import boto3

fileCount = 0

# One-liner
fileCount = sum([page['KeyCount'] for page in boto3.client('s3').get_paginator('list_objects_v2').paginate(Bucket=bucket,Prefix=File)])

# More readable
s3 = boto3.client('s3')
s3p = s3.get_paginator('list_objects_v2')
s3i = s3p.paginate(Bucket=bucket,Prefix=File)
fileCount = sum(KeyCount for KeyCount in s3i.search('KeyCount'))
# Or
for KeyCount in s3i.search('KeyCount'):
  fileCount += KeyCount

Upvotes: 0

Khaleeque Ansari
Khaleeque Ansari

Reputation: 69

If there are more than 1000 objects use this code.

import boto3

def count_objects_in_s3_folder(bucket_name, folder_name):
    # Create an S3 client
    s3 = boto3.client('s3')

    # Specify the bucket and prefix (folder) within the bucket
    bucket = {'Bucket': bucket_name}
    prefix = folder_name + '/'

    # Initialize the object count
    object_count = 0

    # Use the list_objects_v2 API to retrieve the objects in the folder
    paginator = s3.get_paginator('list_objects_v2')
    response_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    # Iterate through the paginated responses
    for response in response_iterator:
        if 'Contents' in response:
            object_count += len(response['Contents'])

    print(f"Number of objects in folder '{folder_name}': {object_count}")

# Provide the S3 bucket name and folder name to count objects in
bucket_name = 'your_bucket_name'
folder_name = 'your_folder_name'

count_objects_in_s3_folder(bucket_name, folder_name)

Upvotes: 2

Kamlesh Gallani
Kamlesh Gallani

Reputation: 771

I know this is an older post, but thought of posting anyway. I did a comparison between several methods and it is evident that paginators with list_objects_v2 as the fastest way to get a list of objects on an S3 bucket when the number of files is greater than 1000.

## through boto3 resource
def get_files_on_s3_resource(bucket_name, folder_path):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    folder_objects = list(bucket.objects.filter(Prefix=folder_path))
    files_on_s3 = []
    for file in folder_objects:
        files_on_s3.append(file.key)
    return files_on_s3

## with paginator for list_objects_v2
def list_s3_objects_wp(bucket_name, folder_path):
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')

    object_list = []
    for page in paginator.paginate(Bucket=bucket_name, Prefix=folder_path):
        for content in page.get('Contents', []):
            object_list.append(content)

    return object_list

## without paginator for list_objects_v2
def list_s3_objects_wop(bucket_name, folder_path):
    s3 = boto3.client('s3')
    # get list of files on s3
    object_list = []
    for obj in s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']:
        object_list.append(obj)
        
    return object_list

## tried a way suggested in one of the answers above 
def list_s3_objects_so(bucket_name, folder_path):
    s3 = boto3.resource('s3')
    # get list of files on s3
    bucket = s3.Bucket(bucket_name)
    count_obj = sum(1 for _ in bucket.objects.filter(Prefix=folder_path))
    return count_obj

Here is the comparative output:

bucket_name ='someBucket'
folder_path = 'someFolder/someKey/'

startr_time = time.time()
files_on_s3 = get_files_on_s3(bucket_name, folder_path)
end_time = time.time()
print('Time taken to get files on s3: ' + str(end_time - startr_time))
print(len(files_on_s3))

startr_time = time.time()
files_on_s3 = list_s3_objects(bucket_name, folder_path)
end_time = time.time()
print('Time taken to get files on s3: ' + str(end_time - startr_time))
print(len(files_on_s3))

startr_time = time.time()
files_on_s3 = list_s3_objects_wop(bucket_name, folder_path)
end_time = time.time()
print('Time taken to get files on s3: ' + str(end_time - startr_time))
print(len(files_on_s3))

startr_time = time.time()
files_on_s3 = list_s3_objects_so(bucket_name, folder_path)
end_time = time.time()
print('Time taken to get files on s3: ' + str(end_time - startr_time))
print(files_on_s3)


> Time taken to get files on s3: 7.044371128082275
> 21976
> Time taken to get files on s3: 4.960357189178467
> 21976
> Time taken to get files on s3: 0.6216549873352051
> 1000
> Time taken to get files on s3: 7.754430055618286
> 21976

Upvotes: 3

Ehsan
Ehsan

Reputation: 163

The following code worked perfectly

def getNumberOfObjectsInBucket(bucketName,prefix):
    count = 0
    response = boto3.client('s3').list_objects_v2(Bucket=bucketName,Prefix=prefix)
    for object in response['Contents']:
        if object['Size'] != 0:
            #print(object['Key'])
            count+=1
    return count

object['Size'] == 0 will take you to folder names, if want to check them, object['Size'] != 0 will lead you to all non-folder keys. Sample function call below:

getNumberOfObjectsInBucket('foo-test-bucket','foo-output/')

Upvotes: 3

Anuj Sharma
Anuj Sharma

Reputation: 537

If you have credentials to access that bucket, then you can use this simple code. Below code will give you a list. List comprehension is used for more readability.

Filter is used to filter objects because in bucket to identify the files ,folder names are used. As explained by John Rotenstein concisely.

import boto3

bucket = "Sample_Bucket"
folder = "Sample_Folder"
s3 = boto3.resource("s3") 
s3_bucket = s3.Bucket(bucket)
files_in_s3 = [f.key.split(folder + "/")[1] for f in s3_bucket.objects.filter(Prefix=folder).all()]

Upvotes: 0

matt burns
matt burns

Reputation: 25370

If there are more than 1000 entries, you need to use paginators, like this:

count = 0
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket='your-bucket', Prefix='your-folder/', Delimiter='/'):
    count += len(result.get('CommonPrefixes'))

Upvotes: 4

vagabond
vagabond

Reputation: 3594

Assuming you want to count the keys in a bucket and don't want to hit the limit of 1000 using list_objects_v2. The below code worked for me but I'm wondering if there is a better faster way to do it! Tried looking if there's a packaged function in boto3 s3 connector but there isn't!

# connect to s3 - assuming your creds are all set up and you have boto3 installed
s3 = boto3.resource('s3')

# identify the bucket - you can use prefix if you know what your bucket name starts with
for bucket in s3.buckets.all():
    print(bucket.name)

# get the bucket
bucket = s3.Bucket('my-s3-bucket')

# use loop and count increment
count_obj = 0
for i in bucket.objects.all():
    count_obj = count_obj + 1
print(count_obj)

Upvotes: 15

John Rotenstein
John Rotenstein

Reputation: 269091

"Folders" do not actually exist in Amazon S3. Instead, all objects have their full path as their filename ('Key'). I think you already know this.

However, it is possible to 'create' a folder by creating a zero-length object that has the same name as the folder. This causes the folder to appear in listings and is what happens if folders are created via the management console.

Thus, you could exclude zero-length objects from your count.

For an example, see: Determine if folder or file key - Boto

Upvotes: 5

Related Questions