Reputation: 9000
I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).
I have this code:
df = spark.read.load("wasb:///data/test/Sample Data.xml")
However I keep getting the error java.io.IOException: Could not read footer for file
:
An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?
Any ideas?
Thanks.
Upvotes: 0
Views: 684
Reputation: 69
Below are the step-by-step instructions I followed to resolve this issue.
Step 1) Install the XML package in the Apache Spark
You can download the JAR from the below link. https://libraries.io/maven/com.databricks:spark-xml_2.12 Instructions: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages#manage-packages-from-synapse-studio-or-azure-portal
Step 2) Grant “Storage Blob data contributor” access to the user for the container under IAM
Step 3) Run the notebook by creating dataframe using the below script
%%pyspark
df = spark.read \
.format("xml") \
.option("rowTag", "xml tag name here") \
.load('abfss://<container name>@<storage account name>.dfs.core.windows.net/<path of the xml file>/file.xml)')
display(df)
Data frame should be displayed.
Upvotes: 0
Reputation: 24148
Please refer to the two blogs below, I think they can answer your question completely.
The code is like as below.
session = SparkSession.builder.getOrCreate()
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
# "fs.azure.sas.<container-name>.blob.core.windows.net",
# "<sas-token>"
# )
# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
.options(rowTag="book").load("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/data/test/")
If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml
library in your Apache Spark cluster.
Hope it helps.
Upvotes: 1