parse xml with multiple rowtags using spark

Question

I want to parser xml using spark so I am using spark databricks lib. sample xml is as follows:

code to parse:

val transNestedDF = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Transactions").load("trans_nested.xml")

transNestedDF.registerTempTable("TransNestedTbl")

sqlContext.sql("select Transaction[0].transid from TransNestedTbl").collect()

Here I don't have any root tag also I can't define multiple row tags so if I have to process both transactions and payments in single read using above single dataframe then how to achieve that?

need help.

Jack Fleeting · Accepted Answer

Let's try this with lxml, a python library, which itself uses xpath:

If you don't have it installed, you need to:

pip intall lxml

then:

import lxml.html

pay = """ [your code above]  """

doc = lxml.html.fromstring(pay)
tid =doc.xpath('Transactions//transid'.lower()) #or ('//Transactions//transid'.lower()) depending on the structure of the original doc
pid = doc.xpath('Payments//id'.lower()) #same comment

final = ''
for i in tid:
    for p in pid:            
        final = final+i.text+'|'+p.text+' 
'

print(final)

Output:

parse xml with multiple rowtags using spark

Answers (2)

Related Questions