Reputation: 31
I am trying to parse xml files with XSD using spark-xml library in pyspark. Below is the code :
xml_df = spark.read.format("com.databricks.spark.xml") \
.option("rootTag", "Document") \
.option("rowTag", "row01") \
.option("rowValidationXSDPath","auth.011.001.02_ABC_1.1.0.xsd") \
.load("/mnt/bronze/ABC-3.xml")
I am getting error as org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.1.4 executor driver): java.util.concurrent.ExecutionException: org.xml.sax.SAXParseException; systemId: file:/local_disk0/auth.011.001.02_ABC_1.1.0.xsd; lineNumber: 5846; columnNumber: 99; Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000.
I have looked for the ways to setup jdk.xml.maxOccurLimit=0 in databricks cluster but didn't find any.
Any help on solving this error will be highly appreciated.
Upvotes: 0
Views: 263
Reputation: 2764
As per documentation, you can setup jdk.xml.maxOccurLimit=0
and also follow below sample code:
I reproduce same in my environment. I got same error
To resolve above error . Follow below Sample Code:
spark.conf.set("spark.jvm.args", "-Djdk.xml.maxOccurLimit=0")
df = spark.read.format("com.databricks.spark.xml").option("rowTag", "book").load("dbfs:/FileStore/gg.xml")
display(df)
Upvotes: 0