How to read a nested xml correctly in pyspark using spark-xml?

Question

I have an xml file like below.


  
    
      
      
        
          1234
          543
          Original Customer Invoice
          20230914
          Eastern Maryland
          21
          Jill
          Jill
          20230909
          ART
          435
          76543
          Priced

And below is my code to read this xml. Here I have read the xml as spark dataframe for a reason and converting it back to pandas dataframe.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("Nested XML to DataFrame").getOrCreate()
df = spark.read.format("com.databricks.spark.xml")\
    .option("rootTag", "document") \
    .option("rowTag", "documentnode")\
    .load("test.xml")
ss_pandas_df = df.toPandas()
print(ss_pandas_df.head(1))

And the output looks like below

 _name                                         categories                            version
0  123.pdf  ((Customer Delivery, [Row(_VALUE='1234', _date...  (None, data.pdf, application/pdf)

When I try to print the categories column data print(ss_pandas_df.iloc[0]['categories']), it looks like this

Row(category=Row(_name='Customer Delivery', attribute=[Row(_VALUE='1234', _dateformat=None, _name='Invoice Number'), Row(_VALUE='543', _dateformat=None, _name='Customer Number'), Row(_VALUE='Original Customer Invoice', _dateformat=None, _name='Document Type'), Row(_VALUE='20230914', _dateformat='yyyyMMdd', _name='Capture Date'), Row(_VALUE='Eastern Maryland', _dateformat=None, _name='Location Name'), Row(_VALUE='21', _dateformat=None, _name='Location Number'), Row(_VALUE='Jill', _dateformat=None, _name='Ship to Customer Name'), Row(_VALUE='Jill', _dateformat=None, _name='Bill to Customer Name'), 
Row(_VALUE='20230909', _dateformat='yyyyMMdd', _name='Delivery Date'), Row(_VALUE='ART', _dateformat=None, _name='Territory Number'), Row(_VALUE='435', _dateformat=None, _name='Manifest Number'), Row(_VALUE='76543', _dateformat=None, _name='Route Number'), Row(_VALUE='Priced', _dateformat=None, _name='Invoice Type')]))

But this is not I am expecting. I need the below attributes into as separate columns with the column name like Invoice Number/Customer Number etc. What am I missing here

note: I have added spark-xml package as well

EDIT: At the end I'm expecting a dataframe with, attribute name as the column name and their values as the column values. ie:

Invoice Number|Customer Number|...|Invoice Type

1234|543|...|Priced

How to read a nested xml correctly in pyspark using spark-xml?

Answers (1)

Related Questions