datahappy
datahappy

Reputation: 856

Importing multiple files from Google Cloud Bucket to Datalab instance

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.

So, I can easily see them as objects using

gcs list --objects gs://<BUCKET_NAME>

Further, I can read in an individual file/object using

 import google.datalab.storage as storage
 import pandas as pd
 from io import BytesIO

 myBucket = storage.Bucket('<BUCKET_NAME')
 data_csv = myBucket.object('<FILE_NAME.json')

 uri = data_csv.uri
 %gcs read --object $uri --variable data

 df = pd.read_csv(BytesIO(data))
 df.head()

(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)

What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?

As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?

Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.

Any help is greatly appreciated.

Upvotes: 0

Views: 2181

Answers (1)

Guillem Xercavins
Guillem Xercavins

Reputation: 7058

This can be done using Bucket.objects which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv and countries2.csv:

$ cat countries1.csv
id,country
1,sweden
2,spain

$ cat countries2.csv
id,country
3,italy
4,france

And used the following Datalab snippet:

import google.datalab.storage as storage
import pandas as pd
from io import BytesIO

myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')

df_list = []

for object in object_list:
  %gcs read --object $object.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data)))

concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

which will output the combined csv:

    id  country
0   1   sweden
1   2   spain
2   3   italy
3   4   france

Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:

object_list = myBucket.objects()

Upvotes: 2

Related Questions