Reputation: 11
I am using spark xpath to get the attribute values from an xml string. The xpath returns an array of values from the xml tag. If there are multiple rows present in a tag with one of the rows having an attribute with null, the xpath function is ignoring that value in the returned array. What I am looking for is, if the value is not present a default string to be returned so that it will not alter the order of values in the array.
df = spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns1:entities><ns1:entity ns1:type="PHYSICIANS"><ns1:entity ns1:instance="207" ns1:type="PHYSICIAN" ns1:id="P1"><ns1:attribute ns1:name="ID">2071</ns1:attribute><ns1:attribute ns1:name="NAME"></ns1:attribute></ns1:entity><ns1:entity ns1:instance="208" ns1:type="PHYSICIAN" ns1:id="P2"><ns1:attribute ns1:name="ID">2081</ns1:attribute><ns1:attribute ns1:name="NAME">Dr. James Hanover</ns1:attribute></ns1:entity></ns1:entity></ns1:entities>']], ['visitors'])
df = df.selectExpr('xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'ID\']/text()") ID','xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'NAME\']/text()") NAME',)
display(df)
This gives me an output as below
What I am expecting as an output is as below
Can someone please help.?
Upvotes: 0
Views: 583
Reputation: 14925
I think the xpath is not working since it is parsing and creating a list. Since the first row and name column has an empty string, it is not added to the result list. I fixed the issue by replacing the empty string with a space.
However, this answer gives a single row, with two lists. Not what we really want.
A better solution is to use the xml library from databricks with a-lot simpler xml file.
The code below creates an example physician file using your data.
#
# 1 - create data file
#
# raw data
xml_str = """
<?xml version="1.0"?>
<physicians>
<physician>
<id>2071</id>
<name></name>
</physician>
<physician>
<id>2081</id>
<name>Dr. James Hanover</name>
</physician>
</physicians>
"""
# raw file
dbutils.fs.put("/temp/physicians.xml", xml_str, True)
The next step is to install the library to the cluster. Please see documentation.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/xml
Next, read the file into a data frame that has 2 rows and 2 columns.
#
# 2 - read data file
#
df2 = spark.read \
.format("xml") \
.option("rootTag", "physicians") \
.option("rowTag", "physician") \
.load("/temp/physicians.xml")
display(df2)
Upvotes: 1