PySpark3 - Reading XML files

Question

I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).

I have this code:

df = spark.read.load("wasb:///data/test/Sample Data.xml")

However I keep getting the error java.io.IOException: Could not read footer for file:

An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?

Any ideas?

Thanks.

Peter Pan · Accepted Answer

Please refer to the two blogs below, I think they can answer your question completely.

The code is like as below.

session = SparkSession.builder.getOrCreate()

session.conf.set(
    "fs.azure.account.key..blob.core.windows.net",
    ""
)
# OR SAS token for a container:
# session.conf.set(
#    "fs.azure.sas..blob.core.windows.net",
#    ""
# )

# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
    .options(rowTag="book").load("wasbs://@.blob.core.windows.net/data/test/")

If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.

Hope it helps.

PySpark3 - Reading XML files

Answers (2)

Related Questions