Keto
Keto

Reputation: 2215

How to get the next blob in Python Google Cloud Storage library method list_blobs

It seems to me that google.cloud.storage.Client::list_blobs returns a HTTPIterator which is not a proper python iterator. See below:

import google.cloud.storage as gcs

client = gcs.Client()

blobs = client.list_blobs("mybucket")
blob = next(blobs)  # TypeError: 'HTTPIterator' object is not an iterator

blob = blobs.__next__()  # AttributeError: 'HTTPIterator' object has no attribute '__next__'

I'm looking for a solution that does not iterate through the entire iterator. The only solution I can come up with is a silly hack: for loop and break after the first loop.

Upvotes: 3

Views: 4352

Answers (1)

John Hanley
John Hanley

Reputation: 81414

Without understanding the details of a Page Iterator, you can simply convert the iterator to a list:

blobs = client.list_blobs(bucketName)
blob_list = list(blobs)

# First blob
blob_list[0].name

# Second blob
blob_list[1].name

# Of course you can check the number of list items with len()
count = len(blob_list)

In reality, it is important to understand that the function list_blobs() does not fetch everything all at once. Typically, the library will fetch 1,000 objects at a time. This is called paging. Assuming a bucket has 1,500 objects, two pages of objects will be fetched by iteration (1000 objects and 500 objects). However, less than 1,000 objects might be returned.

blobs = client.list_blobs(bucketName)
for page in blobs.pages:
        print('Page number: ', blobs.page_number)
        print('Count:       ', page.num_items)

Output:

Page number:  1
Count:        1000
Page number:  2
Count:        500

When you convert a Page Iterator to a list, all of the objects are fetched. For large buckets, this could take a substantial amount of time to only display the first and next objects.

For a better understanding, study the source code for the Page Iterator.

Page Iterators

Upvotes: 3

Related Questions