userMod2
userMod2

Reputation: 9000

PySpark3 - Reading XML files

I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).

I have this code:

df = spark.read.load("wasb:///data/test/Sample Data.xml")

However I keep getting the error java.io.IOException: Could not read footer for file:

An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?

Any ideas?

Thanks.

Upvotes: 0

Views: 684

Answers (2)

Faitus Joseph
Faitus Joseph

Reputation: 69

Below are the step-by-step instructions I followed to resolve this issue.

Step 1) Install the XML package in the Apache Spark

You can download the JAR from the below link. https://libraries.io/maven/com.databricks:spark-xml_2.12 Instructions: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages#manage-packages-from-synapse-studio-or-azure-portal

Step 2) Grant “Storage Blob data contributor” access to the user for the container under IAM

IAM

Step 3) Run the notebook by creating dataframe using the below script

%%pyspark
df = spark.read \
    .format("xml") \
    .option("rowTag", "xml tag name here") \
    .load('abfss://<container name>@<storage account name>.dfs.core.windows.net/<path of the xml file>/file.xml)')

display(df)

Data frame should be displayed.

Upvotes: 0

Peter Pan
Peter Pan

Reputation: 24148

Please refer to the two blogs below, I think they can answer your question completely.

  1. Azure Blob Storage with Pyspark
  2. Reading JSON, CSV and XML files efficiently in Apache Spark

The code is like as below.

session = SparkSession.builder.getOrCreate()

session.conf.set(
    "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
    "<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
#    "fs.azure.sas.<container-name>.blob.core.windows.net",
#    "<sas-token>"
# )

# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
    .options(rowTag="book").load("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/data/test/") 

If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.

Hope it helps.

Upvotes: 1

Related Questions