Boto3 list_object_v2 painfully slow

Question

I am trying to use the Boto3 API (specifically, list_objects_v2, as this is what's recommended) to get all keys on my buckets. I executed 8 tasks in parallel on each bucket that I have on S3 totaling about ~55Gb of data. I have been running this for 16+ hours and none of my calls have returned. Is this expected behavior? If it even has to download the entire 55Gb it should not take more than a few hours to download at most (I'm on a very fast internet connection).

Is AWS rate limiting me in some abnormal way? Their documentation says there is a limit of 5,500 requests per second. Since I'm looking at the order of ~5,000,000 S3 items, if I am being rate limited by my estimate the rate limiting given an infinitely fast connection shouldn't lower bound my transfer time at more than 15 minutes. So this is not the issue?

As an aside, this operation appears to be monopolizing my CPU. The below code is what I am executing for this call. Is there something blatantly obvious I am missing here? Looks to me that it's spending all of it's time in the call to list_objects_v2. I'm no AWS guru so it's possible what I'm doing is super bad for reasons I don't know.

def list_all_keys(self):
        reached_end = False
        all_keys = []
        token = None
        while not reached_end:
            # will limit to 1000 objects
            response = self.client.list_objects_v2(Bucket=self.bucket)
            token = response.get('NextContinuationToken', None)
            contents = response.get('Contents', [])
            all_keys.extend([obj['Key'] for obj in contents])
            if not token and response.get('IsTruncated') is False:
                reached_end = True
        return all_keys

sort_of_3lite_h4xor · Accepted Answer

For anyone looking at this, I have actually arrived at an answer. The key is to not use list_objects_v2 and instead use the S3 resource bucket. This is at least 10x faster on my machine and I guess should always be preferred.

bucket = boto3.resource('s3').Bucket('bucket-name')
keys = []
for obj in bucket.objects.all():
    keys.append(obj.key)

Boto3 list_object_v2 painfully slow

Answers (2)

Related Questions