cpaulik
cpaulik

Reputation: 393

How to get list_blobs to behave like gsutil

I would like to only get the first level of a fake folder structure on GCS.

If I run e.g.:

gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/'

I get a list like this:

gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
gs://gcp-public-data-sentinel-2/tiles/11/
gs://gcp-public-data-sentinel-2/tiles/12/
gs://gcp-public-data-sentinel-2/tiles/13/
gs://gcp-public-data-sentinel-2/tiles/14/
gs://gcp-public-data-sentinel-2/tiles/15/
.
.
.

Running code like the following in the Python API give me an empty result:

from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
                              delimiter='/'):
    print blob.name

If I don't use the delimiter option I get all the results in the bucket which is not very useful.

Upvotes: 7

Views: 16691

Answers (4)

Antoine Neidecker
Antoine Neidecker

Reputation: 841

Here is a faster way (found this in this GitHub thread, posted by @evanj):

def list_gcs_directories(bucket, prefix):
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print(page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

You want to call this function as follows:

client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj,
    prefix='my/prefix/path/within/bucket/')

# Getting rid of the prefix

list_folders = [''.join(indiv_folder.split('/')[-1])
                for indiv_folder in list_folders]

Upvotes: 2

jeffry copps
jeffry copps

Reputation: 305

Here is the right answer that works

To achieve the simple listing of a directory also called as a blob in google storage bucket.

Sample Link: 'gs://BUCKET_A/FOLDER_1/FOLDER_2/FILE_10.txt'

Function to be used: list_blobs.

Parameters required to be passed to the list_blobs

  1. bucket_name - Name of the storage bucket. Example: "BUCKET_A"
  2. prefix - Example: "FOLDER_1/FOLDER_2"
  3. delimiter - The listing shouldn't exceed beyond the character passed to this. For simple listing, the delimiter has to be '/'. Meaning, the folders path for the next hierarchy has to cross '/' and so they will be ignored while traversing by the API implementation.

Sample code

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=delimiter)

# Note: The call returns a response only when the iterator is consumed.
print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

To achieve what we need:

  1. Pass the prefix with trailing slash "/".
  2. Pass delimiter as "/" to restrict listing not go beyond current directory.
  3. Process the results in two forms. Say the blobs is the return value from the list_blobs. Simple iteration of the blobs will return the files available in that level. If one want's the subdirectories in that level, iterate over blobs.prefixes.

In Summary,

Access the files by simply iterating the blobs. Access the sub-folders by simply iterating the blobs.prefixes.

Upvotes: 0

Mischa Lisovyi
Mischa Lisovyi

Reputation: 3223

If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0) one can list the bucket contents using '//' instead of '/'. However, it lists "recursively" down to the actual blob (as it is not a real FS)

Upvotes: 0

Mangu
Mangu

Reputation: 3325

Maybe not the best way, but, inspired by this comment on the official repository:

iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
    print('gs://'+bucket_name+'/'+prefix)

Gives:

gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...

Upvotes: 7

Related Questions