Reputation: 531
I want to parser xml using spark so I am using spark databricks lib. sample xml is as follows:
<Transactions>
<Transaction>
<transid>1111</transid>
</Transaction>
<Transaction>
<transid>2222</transid>
</Transaction>
</Transactions>
<Payments>
<Payment>
<Id>123</Id>
</Payment>
<Payment>
<Id>456</Id>
</Payment>
</Payments>
code to parse:
val transNestedDF = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Transactions").load("trans_nested.xml")
transNestedDF.registerTempTable("TransNestedTbl")
sqlContext.sql("select Transaction[0].transid from TransNestedTbl").collect()
Here I don't have any root tag also I can't define multiple row tags so if I have to process both transactions and payments in single read using above single dataframe then how to achieve that?
need help.
Upvotes: 1
Views: 1878
Reputation: 66886
You can't do it in one read, if there is no tag around both of these. If there is any common parent tag, you can use that as rowTag
and ignore the rest that is parsed.
You can of course read them separately into two DataFrames. That works fine if you treat them separately. But you lose the association between transactions and payments, unless you can join on some ID.
But then I'd wonder why the XML structure doesn't have any common parent if these are associated.
Upvotes: 1
Reputation: 24928
Let's try this with lxml, a python library, which itself uses xpath:
If you don't have it installed, you need to:
pip intall lxml
then:
import lxml.html
pay = """ [your code above] """
doc = lxml.html.fromstring(pay)
tid =doc.xpath('Transactions//transid'.lower()) #or ('//Transactions//transid'.lower()) depending on the structure of the original doc
pid = doc.xpath('Payments//id'.lower()) #same comment
final = ''
for i in tid:
for p in pid:
final = final+i.text+'|'+p.text+' \n'
print(final)
Output:
1111|123
1111|456
2222|123
2222|456
Upvotes: 2