Reputation: 30015
Am trying to ready Excel file (.xlsx
) from Azure Databricks, file is in ADLS Gen 2.
Example:
srcPathforParquet = "wasbs://[email protected]//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://[email protected]//1_Raw//src.xlsx"
Reading parquet file from the path works fine.
srcparquetDF = spark.read.parquet(srcPathforParquet )
Reading excel file from the path throw error: No such file or directory
srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])
Upvotes: 6
Views: 42939
Reputation: 502
From my experience, the following are the basic steps that worked for me in reading the excel file from ADLS2 in the databricks :
com.crealytics:spark-excel_2.12:0.13.6
spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue)
adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> sas key of your adls account
myDataFrame = (spark.read.format("com.crealytics.spark.excel") .option("dataAddress", "'Sheetname'!") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "false") .option("addColorColumns", "false") .option("startColumn", 0) .option("endColumn", 99) .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") .load(FullFilePathExcel) )
Upvotes: 0
Reputation: 23111
The method pandas.read_excel
does not support using wasbs
or abfss
scheme URL to access the file. For more details, please refer to here
So if you want to access the file with pandas, I suggest you create a sas token and use https
scheme with sas token to access the file or download the file as stream then read it with pandas. Meanwhile, you also mount the storage account as filesystem then access file as @CHEEKATLAPRADEEP-MSFT said.
For example
pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)
Install package azure-storage-file-datalake
and xlrd
with pip in databricks
Code
import io
import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient
blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')
file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
downloader =file_client.download_file()
b=downloader.readinto(f)
print(b)
df=pd.read_excel(f)
print(df)
Besides we also can use pyspark to read excel file. But we need to add jar com.crealytics:spark-excel
in our environment. For more details, please refer to here and here
For example
Add package com.crealytics:spark-excel_2.12:0.13.1
via maven. Besides, please note that if you use scala 2.11, please add package com.crealytics:spark-excel_2.11:0.13.1
Code
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')
print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.load('abfss://[email protected]/data/sample.xlsx')
df.show()
Upvotes: 5
Reputation: 12768
As per my repro, reading excel file from ADLS gen2 cannot accessed directly using the storage account access key. When I tried reading excel file via ADLS gen2 URL, I got the same error message as FileNotFoundError: [Errno 2] No such file or directory: 'abfss://[email protected]/flightdata/drivers.xlsx'
.
Steps to read Excel file (.xlsx
) from Azure Databricks, file is in ADLS Gen 2:
Step1: Mount the ADLS Gen2 storage account.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Step2: Read excel file using the mount path.
Reference: Azure Databricks - Azure Data Lake Storage Gen2
Upvotes: 4