newbee123
newbee123

Reputation: 31

pyspark: org.xml.sax.SAXParseException Current config of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000

I am trying to parse xml files with XSD using spark-xml library in pyspark. Below is the code :

xml_df = spark.read.format("com.databricks.spark.xml") \
    .option("rootTag", "Document") \
    .option("rowTag", "row01") \
    .option("rowValidationXSDPath","auth.011.001.02_ABC_1.1.0.xsd") \
    .load("/mnt/bronze/ABC-3.xml")

I am getting error as org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.1.4 executor driver): java.util.concurrent.ExecutionException: org.xml.sax.SAXParseException; systemId: file:/local_disk0/auth.011.001.02_ABC_1.1.0.xsd; lineNumber: 5846; columnNumber: 99; Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000.

I have looked for the ways to setup jdk.xml.maxOccurLimit=0 in databricks cluster but didn't find any.

Any help on solving this error will be highly appreciated.

Upvotes: 0

Views: 263

Answers (1)

Vamsi Bitra
Vamsi Bitra

Reputation: 2764

As per documentation, you can setup jdk.xml.maxOccurLimit=0 and also follow below sample code:

I reproduce same in my environment. I got same error

enter image description here

To resolve above error . Follow below Sample Code:

spark.conf.set("spark.jvm.args", "-Djdk.xml.maxOccurLimit=0")

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "book").load("dbfs:/FileStore/gg.xml")  
display(df)

enter image description here

Upvotes: 0

Related Questions