Reputation: 856
I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.
So, I can easily see them as objects using
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.
Any help is greatly appreciated.
Upvotes: 0
Views: 2181
Reputation: 7058
This can be done using Bucket.objects
which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv
and countries2.csv
:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:
object_list = myBucket.objects()
Upvotes: 2