Ashish Dhamu
Ashish Dhamu

Reputation: 71

How to apply regex for Google Cloud Storage Bucket using Python

I am fetch objects from Google Cloud Storage using python, in the folder there are many files (around 20000).

But I just need a particular file which is .json file all other files are in csv format. For now I am using following code with prefix option:

from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

blobs = list(bucket.list_blobs(prefix="input"))

for blob in blobs:
    if '.json' in blob.name:
        filename = blob.name
        break

This process is not stable as file count is going to be increased and will take much time to filter the json file.(file name is dynamic and could be anything)

Is there any option that can be used like regex filter while fetching the data from cloud storage?

Upvotes: 3

Views: 5424

Answers (2)

Connor Ross
Connor Ross

Reputation: 355

This may be a new feature in the python sdk, but you can pass in a match glob into the list blobs function.

See instructions here. Main point is pass in a match_glob parameter to the list_blobs function using standard file glob syntax.

from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

blobs = list(bucket.list_blobs(prefix="input", match_glob="**.json"))
filenames = [blob.name for blob in blobs]

the ** says match one or more anything including folder slashes.

Upvotes: 0

Iñigo González
Iñigo González

Reputation: 3955

If you want to check filename/extension against a regex it's quite easy.

Just import the 're' module at the beginning

import re

And check against a regex inside the loop:

for blob in blobs:
    if re.match(r'\.json$',blob.name):
        filename = blob.name
        break

You can develop the regex at regex101.com before you burn it on your code.

BTW - I prefer to check extensions with str.endswith which is quite fast:

for blob in blobs:
    if blob.name.endswith('.json'):
        filename = blob.name
        break

I wouldn't use

if '.json' in filename:
   etc...

because it might match anything other filenames like 'compressed.json.gz'

Upvotes: 2

Related Questions