Reputation: 21
My file is stored on an Azure's blob storage and it looks like 1627937153-1627937153-ab_test-20210604-0-0.parquet.gz
. How can I read the data from this file in databricks using Python without downloading this file into databricks environment? I have multiple files of the same format in the same folder. Can anyone help me with this?
Upvotes: 2
Views: 4334
Reputation: 10871
You may try
import pandas as pd
df = read_parquet("myFile.parquet.gzip")
display(df)
as referred in here by @bala (or)
2. From SO reference
import io
df = pd.read_parquet(blob_to_read, engine='pyarrow')
display(df)
(Or) 3.
Try using gzip file to read from a zip file
import gzip
file = gzip.open("filename.parquet.gz", "rb")
df = file.read()
display(df)
You can also this article on zip-files-python taken from zip-files-python-notebook which shows how to unzip files which has these steps as below :
1.retrieve file
2.Unzip file
3.Move file to DBFS
And finally load file into data frame using
df = spark.read.format("parquet").option("inferSchema", "true").option("header","true").load("dbfs:/tmp/LoanStats3a. parquet ")
display(df)
If in case you use azure data lake gen2 ,check this pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2. See > Use pyarrow with Azure Data Lake gen2 .It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.
Upvotes: 0