Reputation: 71
I am fetch objects from Google Cloud Storage using python, in the folder there are many files (around 20000).
But I just need a particular file which is .json file all other files are in csv format. For now I am using following code with prefix option:
from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="input"))
for blob in blobs:
if '.json' in blob.name:
filename = blob.name
break
This process is not stable as file count is going to be increased and will take much time to filter the json file.(file name is dynamic and could be anything)
Is there any option that can be used like regex filter while fetching the data from cloud storage?
Upvotes: 3
Views: 5424
Reputation: 355
This may be a new feature in the python sdk, but you can pass in a match glob into the list blobs function.
See instructions here. Main point is pass in a match_glob
parameter to the list_blobs function using standard file glob syntax.
from google.cloud import storage
import json
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix="input", match_glob="**.json"))
filenames = [blob.name for blob in blobs]
the **
says match one or more anything including folder slashes.
Upvotes: 0
Reputation: 3955
If you want to check filename/extension against a regex it's quite easy.
Just import the 're' module at the beginning
import re
And check against a regex inside the loop:
for blob in blobs:
if re.match(r'\.json$',blob.name):
filename = blob.name
break
You can develop the regex at regex101.com before you burn it on your code.
BTW - I prefer to check extensions with str.endswith which is quite fast:
for blob in blobs:
if blob.name.endswith('.json'):
filename = blob.name
break
I wouldn't use
if '.json' in filename:
etc...
because it might match anything other filenames like 'compressed.json.gz'
Upvotes: 2