Reputation: 341
I want to open Excel files in Azure Databricks that reside in ADSL2 with this code:
#%pip install openpyxl pandas
import pandas as pd
display(dbutils.fs.ls("/mnt/myMnt"))
path = "/mnt/myMnt/20241007112914_Statistik_789760_0000_327086871111430.xlsx"
df_xl = pd.read_excel(path, engine="openpyxl")
The third line returns a list of my Excel files in the ADSL, as expected. So the files exist and are accessible.
However the last line results in this error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/MyMnt/20241007112914_Statistik_789760_0000_327086871111430.xlsx'
Could it be, that Pandas has no access to the Blob storage and I have to move the files into DBFS first? If so, how?
Upvotes: 0
Views: 49
Reputation: 1625
Pandas' read_excel function does not have direct access to files stored in Azure Data Lake Storage Gen2 (ADLS2) through the Databricks File System (DBFS) mount point /mnt/myMnt.
Pandas operates on the local filesystem, and therefore, you need to reference the file in a way that Pandas can access.
In Databricks, the /dbfs/ directory provides a local filesystem view of the DBFS mount points, which allows you to access files stored in DBFS using standard file system paths that libraries like Pandas can recognize.
The /dbfs/ directory is a local mapping to the DBFS root. This allows any local process (like Pandas) to access DBFS files as if they were on the local filesystem.
You can use something like below. This method copies the file to the /tmp/ directory on the driver node.
#%pip install openpyxl pandas
import pandas as pd
display(dbutils.fs.ls("/mnt/myMnt"))
path = "/dbfs/mnt/myMnt/20241007112914_Statistik_789760_0000_327086871111430.xlsx"
df_xl = pd.read_excel(path, engine="openpyxl")
display(df_xl)
Or
You could use Databricks' built-in functions like spark.read.format("com.crealytics.spark.excel") which natively understand DBFS paths.
df_spark = (spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'Sheet1'!A1:F10")
.option("treatEmptyValuesAsNulls", "true")
.load("/mnt/myMnt/my_excel.xlsx"))
Upvotes: 0