Reputation: 31
I am trying to use the Boto3 API (specifically, list_objects_v2, as this is what's recommended) to get all keys on my buckets. I executed 8 tasks in parallel on each bucket that I have on S3 totaling about ~55Gb of data. I have been running this for 16+ hours and none of my calls have returned. Is this expected behavior? If it even has to download the entire 55Gb it should not take more than a few hours to download at most (I'm on a very fast internet connection).
Is AWS rate limiting me in some abnormal way? Their documentation says there is a limit of 5,500 requests per second. Since I'm looking at the order of ~5,000,000 S3 items, if I am being rate limited by my estimate the rate limiting given an infinitely fast connection shouldn't lower bound my transfer time at more than 15 minutes. So this is not the issue?
As an aside, this operation appears to be monopolizing my CPU. The below code is what I am executing for this call. Is there something blatantly obvious I am missing here? Looks to me that it's spending all of it's time in the call to list_objects_v2
. I'm no AWS guru so it's possible what I'm doing is super bad for reasons I don't know.
def list_all_keys(self):
reached_end = False
all_keys = []
token = None
while not reached_end:
# will limit to 1000 objects
response = self.client.list_objects_v2(Bucket=self.bucket)
token = response.get('NextContinuationToken', None)
contents = response.get('Contents', [])
all_keys.extend([obj['Key'] for obj in contents])
if not token and response.get('IsTruncated') is False:
reached_end = True
return all_keys
Upvotes: 0
Views: 5144
Reputation: 425
client is almost always faster and better than resource, except in ease of use. Depending on os or versions, client can turn a 1 hour list_objects to less than 5 minutes. Or both versions will be 5 minutes (with client being being 0.5~1 minute faster). I don't know the circumstances, but my linux version had the issue of resource being 1 hr vs 5 min client. While on windows they were both 5 min. But this could have been due to outdated boto3 library on that linux machine.
Anyways,
The error in your code is that you never used the continuation token, so all calls to list_objects_v2 were calling the same 1000 files over and over again until the end of time, infinitely increasing your list size.
def list_all_keys(self):
reached_end = False
all_keys = []
token = None
first = false
while not reached_end:
# will limit to 1000 objects
if first:
response = self.client.list_objects_v2(Bucket=self.bucket)
else:
response = self.client.list_objects_v2(Bucket=self.bucket, NextContinuationToken=token)
token = response.get('NextContinuationToken', None)
contents = response.get('Contents', [])
all_keys.extend([obj['Key'] for obj in contents])
if not token and response.get('IsTruncated') is False:
reached_end = True
return all_keys
Upvotes: 0
Reputation: 31
For anyone looking at this, I have actually arrived at an answer. The key is to not use list_objects_v2
and instead use the S3 resource bucket. This is at least 10x faster on my machine and I guess should always be preferred.
bucket = boto3.resource('s3').Bucket('bucket-name')
keys = []
for obj in bucket.objects.all():
keys.append(obj.key)
Upvotes: 3