Reputation: 1190
I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.
Could you please assist me by pointing me towards the right direction?
I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.
Thank you in advance.
Upvotes: 5
Views: 9223
Reputation: 61
You may use gcsfs and pyarrow libraries to do so.
import gcsfs
from pyarrow import parquet
url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()
// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()
Upvotes: 6
Reputation: 1168
You can read it with pandas.read_parquet like this:
df = pandas.read_parquet('gs:/bucket_name/file_name')
Additionally you will need gcsfs
library and either pyarrow
or fastparquet
installed.
Don't forget to provide credentials in case you access private bucket.
Upvotes: 5