Reputation: 7168
I use boto S3 API in my python script which slowly copies data from S3 to my local filesystem. The script worked well for a couple of days, but now there is a problem.
I use the following API function to obtain the list of keys in "directory":
keys = bucket.get_all_keys(prefix=dirname)
And this function (get_all_keys
) does not always return the full list of keys, I mean I can see more keys through AWS web-interface or via aws s3 ls s3://path
.
Reproduced the issue on versions 2.15 and 2.30.
Maybe boto cached some of my requests to S3 (I repeat same requests over and over again)? How to resolve this issue, any suggestions?
Upvotes: 6
Views: 10930
Reputation: 1383
Use pagination in boto3. this function should give you the answer:
def s3_list_files(bucket_name, prefix):
paginator = client.get_paginator("list_objects")
page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
keys = []
for page in page_iterator:
if "Contents" in page:
for key in page["Contents"]:
keyString = key["Key"]
keys.append(keyString)
return keys if keys else []
Upvotes: 1
Reputation: 19975
You need to paginate through the results, by making multiple requests. list() will do this for you automatically. You can use the below example for greater control or to resume from failed requests.
This iterative approach is also more scalable if you're working will millions of objects.
marker = None
while True:
keys = bucket.get_all_keys(marker=marker)
last_key = None
for k in keys:
# TODO Do something with your keys!
last_key = k.name
if not keys.is_truncated:
break
marker = last_key
The ResultSet docs from the get_all_keys() docs say this should be done autoamtically by the for iterator, but it doesn't. :(
Upvotes: 3
Reputation: 45846
There is an easier way. The Bucket
object itself can act as an iterator and it knows how to handle paginated responses. So, if there are more results available, it will automatically fetch them behind the scenes. So, something like this should allow you to iterate over all of the objects in your bucket:
for key in bucket:
# do something with your key
If you want to specify a prefix and get a listing of all keys starting with that prefix, you can do it like this:
for key in bucket.list(prefix='foobar'):
# do something with your key
Or, if you really, really want to build up a list of objects, just do this:
keys = [k for k in bucket]
Note, however, that buckets can hold an unlimited number of keys so be careful with this because it will build a list of all keys in memory.
Upvotes: 13
Reputation: 7168
Just managed to get it working!
It turned out that I had 1013 keys in my directory on S3 and get_all_keys
can return only 1000 keys due to AWS API restrictions.
The solution is simple, just use more high-level function without delimiter
parameter:
keys = list(bucket.list(prefix=dirname))
Upvotes: 5