marc.fargas
marc.fargas

Reputation: 786

google-coud-storage python list_blobs performance

I have a very simple python function:

def list_blobs(bucket, project)
    storage_client = storage.Client(project=project)
    bucket = storage_client.get_bucket(bucket)
    blobs = bucket.list_blobs(prefix='basepath/', max_results=999999,
                              fields='items(name,md5Hash),nextPageToken')
    r = [(b.name, b.md5_hash) for b in blobs]

The blobs list contains 14599 items, and this code takes 7 seconds to run. When profiling most of the time is wasted reading from the server (there are 16 calls to page_iterator._next_page.

So, how can I improve here? The iteration code is deep in the library, and the pointer to each page comes from the previous page, so I see no way on how to fetch the 16 pages in parallel so I can cut down those 7 seconds.

Profile from SnakeViz

I am on python 3.6.8,

google-api-core==1.7.0
google-auth==1.6.2
google-cloud-core==0.29.1
google-cloud-storage==1.14.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.6
protobuf==3.6.1

Upvotes: 5

Views: 3226

Answers (1)

Dan Cornilescu
Dan Cornilescu

Reputation: 39824

Your max_results=999999 is larger than 14599 - the number of objects, forcing all results into a single page. From Bucket.list_blobs():

Parameters:

max_results (int) – (Optional) The maximum number of blobs in each page of results from this request. Non-positive values are ignored. Defaults to a sensible value set by the API.

My guess is that the code spends a lot of time blocked waiting for the server to provide the info needed to iterate through the results.

So the 1st thing I'd try would be to actually iterate through multiple pages, using a max_results smaller than the number of blobs. Maybe 1000 or 2000 and see the impact on overall duration?

Maybe even trying to use the multiple pages explicitly, using blobs.pages, as suggested in the deprecated page_token property doc (emphasis mine):

page_token (str) – (Optional) If present, return the next batch of blobs, using the value, which must correspond to the nextPageToken value returned in the previous response. Deprecated: use the pages property of the returned iterator instead of manually passing the token.

But I'm not quite sure how to force the multiple pages to be simultaneously pulled. Maybe something like this?

[(b.name, b.md5_hash) for page in blobs.pages for b in page]

Upvotes: 1

Related Questions