Aman Patel AIITycJhtw
Aman Patel AIITycJhtw

Reputation: 21

How to read parquet file compressed by .gz in databricks?

My file is stored on an Azure's blob storage and it looks like 1627937153-1627937153-ab_test-20210604-0-0.parquet.gz. How can I read the data from this file in databricks using Python without downloading this file into databricks environment? I have multiple files of the same format in the same folder. Can anyone help me with this?

Upvotes: 2

Views: 4334

Answers (1)

kavya Saraboju
kavya Saraboju

Reputation: 10871

You may try

import pandas as pd
df = read_parquet("myFile.parquet.gzip")
display(df)

as referred in here by @bala (or)

2. From SO reference

 import io
df = pd.read_parquet(blob_to_read, engine='pyarrow')
display(df)

(Or) 3.

Try using gzip file to read from a zip file

import gzip
file = gzip.open("filename.parquet.gz", "rb")
df = file.read()
display(df)

You can also this article on zip-files-python taken from zip-files-python-notebook which shows how to unzip files which has these steps as below :

1.retrieve file

2.Unzip file

3.Move file to DBFS

And finally load file into data frame using

df = spark.read.format("parquet").option("inferSchema", "true").option("header","true").load("dbfs:/tmp/LoanStats3a. parquet ")
display(df)

If in case you use azure data lake gen2 ,check this pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2. See > Use pyarrow with Azure Data Lake gen2 .It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

Upvotes: 0

Related Questions