Reputation: 5126

How to get subkeys to iterate over and eventually the files inside them in AWS S3

I have AWS S3 key path as bucket-name/fo1/fo2/fo3 that has subpaths as bucket-name/fo1/fo2/fo3/fo_1, bucket-name/fo1/fo2/fo3/fo_2, bucket-name/fo1/fo2/fo3/fo_3 and so on. I want to iterate over these keys fo_1, fo_2, fo_3 etc. within the path bucket-name/fo1/fo2/fo3.

I tried the following but this doesn't work.

s3 = boto3.client('s3')
s3_bucket = 'bucket-name'

prefix = 'fo1/fo2/fo3'
for obj in s3.list_objects_v2(Bucket=s3_bucket, Prefix=prefix, Delimiter='/'):
     # Here when I print obj, it's a string with value as 'MaxKeys'

Any help will be appreciated!

UPDATE:

s3://bucket-name/
        fo1/
           fo2/
              fo3/
                 fo_1/
                     file1
                     ...
                 fo_2/
                     file2
                     ...
                 fo_3/
                     file1
                     ...
                 fo_4/
                     file1
                     ...
                 ...

This is my structure and I am looking to get fo_1, fo_2, fo_3 and files inside it. I want everything inside object fo3 and nothing outside of that.

Upvotes: 1

Answers (3)

Christopher Lynch

Reputation: 201

Possibly the following piece of code can be of use for you. I expanded a bit on John's answer as I was looking for something similar. I basically recreated the os.walk() behavior, which you might be more familiar with

    import os
    import boto3
    
    # function to replicate os.walk behavior
    def s3walk( locations,prefix):
        
        # recursively add location to roots starting from prefix
        def processLocation( root,prefixLocal,location):
            # add new root location if not available
            if not prefixLocal in root:
                root[prefixLocal]=(set(),set())
            # check how many folders are available after prefix
            remainder = location[len(prefixLocal)+1:]
            structure = remainder.split('/')
            #if we are not yet in the folder of the file we need to continue with a larger prefix
            if len(structure)>1:
                # add folder dir
                root[prefixLocal][0].add(structure[0])
                #make sure file is added allong the way
                processLocation(root, prefixLocal+'/'+structure[0],location )
            else:
                # add to file
                root[prefixLocal][1].add(structure[0])
            
        root={}
        for location in locations:
            processLocation(root,prefix,location)
    
        return root.items()
    
    if __name__ == "__main__":    
        s3_client = boto3.client('s3', region_name='eu-west-3')
        s3_bucket = 'bucket-name'
        prefix = 'fo1/fo2/fo3'
        
        # get list of objects with prefix 
        response = s3_client.list_objects_v2(Bucket=s3_bucket,Prefix=prefix)
        # retrieve key values
        locations = [ object['Key'] for object in response['Contents']]
       
    
    
        for root, (subdir, files) in s3walk(locations,prefix):
            print(root,subdir,files)

Upvotes: 0

John Rotenstein

Reputation: 269320

You should examine the value returned by the list_objects_v2() call to understand the data that is being returned.

If a Prefix has been specified, it returns the contents of the exact prefix that has been provided. Any sub-directories are returned as CommonPrefixes.
If no Prefix is provided, all objects in the bucket are returned. You can then filter it yourself in code, as shown below.

import boto3

s3_client = boto3.client('s3', region_name='ap-southeast-2')
s3_bucket = 'my-bucket'
prefix = 'fo1/fo2/fo3/'

response = s3_client.list_objects_v2(Bucket=s3_bucket)
for object in response['Contents']:
    if object['Key'].startswith(prefix):
        print(object['Key'])

Upvotes: 0

John Rotenstein

Reputation: 269320

The first thing to understand about Amazon S3 is that folders do not exist. Rather, objects are stored with their full path as their Key (filename).

For example, I could copy a file to a bucket using the AWS Command-Line Interface (CLI):

aws s3 cp foo.txt s3://my-bucket/fo1/fo2/fo3/foo.txt

This would work even though the folders do not exist.

To make things convenient for humans, there is a "pretend" set of folders that are provided via the concept of a common prefix. Thus, in the management console, the folders would appear to be there. However, if the object was then deleted with:

aws s3 rm s3://my-buket/fo1/fo2/fo3/foo.txt

The result is that the folders would immediately disappear because they never actually existed!

Also for convenience, some Amazon S3 commands allow you to specify a Prefix and Delimiter. This can be used to, for example, only list objects in the fo3 folder. What it is really doing is merely listing the objects that have a Key that starts with fo1/fo2/fo3/. When the Key for the object is returned, it will always have the full path to the object, because the Key actually is the full path. (There is no concept of a filename separate to the complete Key.)

So, if you want a listing of all files in fo1 and fo2 and fo3, you can do a listing with a Prefix of fo1 and receive back all objects that start with fo1/, but this will include objects in sub-folders since they all have a prefix of fo1/.

Bottom line: Rather than thinking of old-fashioned directories, think of Amazon S3 as a flat storage structure. If necessary, you can do filtering of results in your own code.

Upvotes: 3

How to get subkeys to iterate over and eventually the files inside them in AWS S3

Answers (3)

Related Questions