Read and Write csv and other file formats to Google Cloud Storage with Pandas

def get_config_files(self):
       dict_path = 'word.pkl'
       self.kw_ns = ConfigParser()
       self.kw_ns.add_section('Paths')
       self.kw_ns.set('Paths','new_df1','gs://'+filepath, encoding='utf-8')
       self.kw_ns.set('Paths','dictionary','gs://'+dict_path)
       new_df1 =  pd.read_csv(self.kw_ns.get('Paths','new_df1'))
       dict = pickle.load(open(self.abs_path+self.kw_ns.get('Paths','dictionary'), 'rb'))

I could neither read the csv nor the pickle file since it throws file not found error. I have pandas version 0.25 and gcsfs installed and imported. Any pointers on how can it be accomplished

Upvotes: 0

Answers (1)

Jerry101

Reputation: 13467

With gcsfs, you need to do a bit of setup, in particular open a File-like object which you can then read or write. Please see the documentation.

import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my-bucket/my-file.txt', 'rb') as f:
    print(f.read())

Also beware that you may need to authenticate to access the desired project and its storage bucket. And if your program is running in Google Compute Engine (GCE), the GCE VM will need the storage-rw Scope (or another Scope that implies storage-rw) and the Service Account will need the Storage Object Admin Permission.

The more typical ways for a Python program to access Google Cloud Storage (GCS) are:

Install the GCS Python client library and make calls to that library's API, e.g. to upload a file to a GCS blob (aka object; the closest thing it has to a file). Again, you'll need the right Scopes and Permissions. It does not implement gs:// pathnames.
Shell out to a gsutil command line invocation to copy a local file to or from GCS. In this case you provide gs:// pathnames. (In Python 3 I'd use the subprocess built-in library to shell out. In Python 2 I'd use the subprocess32 library installed from PYPI, which is a back-ported version of the same library, with bug fixes.)
Install gcsfuse, run it to mount a GCS bucket (optionally narrowed to a specific "subdirectory") to a local directory. Then read/write files in that local directory.

GCS is really a flat object store, not a file system. E.g. it does not support multiple simultaneous readers and writers to a file; just atomic read or write of a blob.

GCS does not actually have directories, just paths that contain slash characters. With gcsfuse you can mount the bucket with --implicit-dirs, in which case it fakes the directories (and runs very slowly), or else you have to have "directory placeholders" (0-length objects with names ending in /). Without --implicit-dirs it will create the placeholders during certain operations but won't even see "subdirectories" that don't have them.

Please read the gcsfuse documentation on how its semantics differ from a file system even while gcsfuse does its best to bridge the gap.

Upvotes: 3

Read and Write csv and other file formats to Google Cloud Storage with Pandas

Answers (1)

Related Questions