How to create pandas dataframe from parquet files kept on google storage

I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.

Could you please assist me by pointing me towards the right direction?

I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.

Thank you in advance.

Upvotes: 5

Answers (2)

Terence

Reputation: 61

You may use gcsfs and pyarrow libraries to do so.

import gcsfs
from pyarrow import parquet

url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()

// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()

Upvotes: 6

Emil Gi

Reputation: 1168

You can read it with pandas.read_parquet like this:

df = pandas.read_parquet('gs:/bucket_name/file_name')

Additionally you will need gcsfs library and either pyarrow or fastparquet installed.

Don't forget to provide credentials in case you access private bucket.

Upvotes: 5

How to create pandas dataframe from parquet files kept on google storage

Answers (2)

Related Questions