Aleksei Petrenko
Aleksei Petrenko

Reputation: 7168

Boto S3 API does not return full list of keys

I use boto S3 API in my python script which slowly copies data from S3 to my local filesystem. The script worked well for a couple of days, but now there is a problem.

I use the following API function to obtain the list of keys in "directory":

keys = bucket.get_all_keys(prefix=dirname)

And this function (get_all_keys) does not always return the full list of keys, I mean I can see more keys through AWS web-interface or via aws s3 ls s3://path.

Reproduced the issue on versions 2.15 and 2.30.

Maybe boto cached some of my requests to S3 (I repeat same requests over and over again)? How to resolve this issue, any suggestions?

Upvotes: 6

Views: 10930

Answers (4)

hamed
hamed

Reputation: 1383

Use pagination in boto3. this function should give you the answer:

def s3_list_files(bucket_name, prefix):
    paginator = client.get_paginator("list_objects")

    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
    keys = []
    for page in page_iterator:
        if "Contents" in page:
            for key in page["Contents"]:
                keyString = key["Key"]
                keys.append(keyString)

    return keys if keys else []

Upvotes: 1

Joseph Lust
Joseph Lust

Reputation: 19975

You need to paginate through the results, by making multiple requests. list() will do this for you automatically. You can use the below example for greater control or to resume from failed requests.

This iterative approach is also more scalable if you're working will millions of objects.

marker = None
while True:
    keys = bucket.get_all_keys(marker=marker)
    last_key = None

    for k in keys:
        # TODO Do something with your keys!
        last_key = k.name

    if not keys.is_truncated:
        break

    marker = last_key

The ResultSet docs from the get_all_keys() docs say this should be done autoamtically by the for iterator, but it doesn't. :(

Upvotes: 3

garnaat
garnaat

Reputation: 45846

There is an easier way. The Bucket object itself can act as an iterator and it knows how to handle paginated responses. So, if there are more results available, it will automatically fetch them behind the scenes. So, something like this should allow you to iterate over all of the objects in your bucket:

for key in bucket:
    # do something with your key

If you want to specify a prefix and get a listing of all keys starting with that prefix, you can do it like this:

for key in bucket.list(prefix='foobar'):
    # do something with your key

Or, if you really, really want to build up a list of objects, just do this:

keys = [k for k in bucket]

Note, however, that buckets can hold an unlimited number of keys so be careful with this because it will build a list of all keys in memory.

Upvotes: 13

Aleksei Petrenko
Aleksei Petrenko

Reputation: 7168

Just managed to get it working! It turned out that I had 1013 keys in my directory on S3 and get_all_keys can return only 1000 keys due to AWS API restrictions.

The solution is simple, just use more high-level function without delimiter parameter:

keys = list(bucket.list(prefix=dirname))

Upvotes: 5

Related Questions