ungalVicky
ungalVicky

Reputation: 53

How to parse XML with XSD using spark-xml package?

I am trying to parse simple XML by supplying XSD schema. Using the approach given here.

https://github.com/databricks/spark-xml#xsd-support

XML is here:

<?xml version="1.0"?>  
<beginnersbook>
 <to>My Readers</to>
 <from>Chaitanya</from>
 <subject>A Message to my readers</subject>
 <message>Welcome to beginnersbook.com</message>
</beginnersbook>

XSD is here:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://www.beginnersbook.com"
xmlns="https://www.beginnersbook.com"
elementFormDefault="qualified">

<xs:element name="beginnersbook">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="to" type="xs:string"/>
      <xs:element name="from" type="xs:string"/>
      <xs:element name="subject" type="xs:string"/>
      <xs:element name="message" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>

I am trying to read this XSD and trying to build schema like below.

import com.databricks.spark.xml.util.XSDToSchema
import java.nio.file.Paths
val schemaParsed = XSDToSchema.read(Paths.get("<local_linux_path>/sample_file.xsd"))
print(schema)

Here schema successfully parsed. Next I am reading XML file like below.

val df = spark.read.format("com.databricks.spark.xml").schema(schemaParsed).load("<hdfs_path>/sample_file.xml")

After this step I can display schema of Dataframe using df.printSchema() , But content is coming as empty if I am giving df.show()

Please guide me where I am doing wrong here.

Thanks in advance.

Upvotes: 1

Views: 5462

Answers (1)

John Glenn
John Glenn

Reputation: 1629

For those who come here in search of an answer, you can use tools like this online XSD / XML validator to pick out the errors in parsing your XML sample against your schema.

In this case, the targetNamespace="https://www.beginnersbook.com" in the XSD without a corresponding namespace being used in the XML caused the issue. The issue can be resolved by either removing the target namespace from the XSD or by modifying the XML to use the target namespace. The XSD modification is straight-forward - just remove it. The XML modification could look like this in simple form:

<?xml version="1.0"?>  
<beginnersbook
    xmlns="https://www.beginnersbook.com">
 <to>My Readers</to>
 ...

Upvotes: 0

Related Questions