ahkam
ahkam

Reputation: 767

How to read a nested xml correctly in pyspark using spark-xml?

I have an xml file like below.

<document>
  <root dataid="2000" path="U.S. Broadline OpCos:Eastern Maryland:Documents-Standard:Finance:Accounts Receivables">
    <documentnode name="123.pdf">
      <version filepath="data.pdf" mimetype="application/pdf"></version>
      <categories>
        <category name="Customer Delivery">
          <attribute name="Invoice Number">1234</attribute>
          <attribute name="Customer Number">543</attribute>
          <attribute name="Document Type">Original Customer Invoice</attribute>
          <attribute name="Capture Date" dateformat="yyyyMMdd">20230914</attribute>
          <attribute name="Location Name">Eastern Maryland</attribute>
          <attribute name="Location Number">21</attribute>
          <attribute name="Ship to Customer Name">Jill</attribute>
          <attribute name="Bill to Customer Name">Jill</attribute>
          <attribute name="Delivery Date" dateformat="yyyyMMdd">20230909</attribute>
          <attribute name="Territory Number">ART</attribute>
          <attribute name="Manifest Number">435</attribute>
          <attribute name="Route Number">76543</attribute>
          <attribute name="Invoice Type">Priced</attribute>
        </category>
      </categories>
    </documentnode>
  </root>
</document>

And below is my code to read this xml. Here I have read the xml as spark dataframe for a reason and converting it back to pandas dataframe.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("Nested XML to DataFrame").getOrCreate()
df = spark.read.format("com.databricks.spark.xml")\
    .option("rootTag", "document") \
    .option("rowTag", "documentnode")\
    .load("test.xml")
ss_pandas_df = df.toPandas()
print(ss_pandas_df.head(1))

And the output looks like below

 _name                                         categories                            version
0  123.pdf  ((Customer Delivery, [Row(_VALUE='1234', _date...  (None, data.pdf, application/pdf)

When I try to print the categories column data print(ss_pandas_df.iloc[0]['categories']), it looks like this

Row(category=Row(_name='Customer Delivery', attribute=[Row(_VALUE='1234', _dateformat=None, _name='Invoice Number'), Row(_VALUE='543', _dateformat=None, _name='Customer Number'), Row(_VALUE='Original Customer Invoice', _dateformat=None, _name='Document Type'), Row(_VALUE='20230914', _dateformat='yyyyMMdd', _name='Capture Date'), Row(_VALUE='Eastern Maryland', _dateformat=None, _name='Location Name'), Row(_VALUE='21', _dateformat=None, _name='Location Number'), Row(_VALUE='Jill', _dateformat=None, _name='Ship to Customer Name'), Row(_VALUE='Jill', _dateformat=None, _name='Bill to Customer Name'), 
Row(_VALUE='20230909', _dateformat='yyyyMMdd', _name='Delivery Date'), Row(_VALUE='ART', _dateformat=None, _name='Territory Number'), Row(_VALUE='435', _dateformat=None, _name='Manifest Number'), Row(_VALUE='76543', _dateformat=None, _name='Route Number'), Row(_VALUE='Priced', _dateformat=None, _name='Invoice Type')]))

But this is not I am expecting. I need the below attributes into as separate columns with the column name like Invoice Number/Customer Number etc. What am I missing here

note: I have added spark-xml package as well

EDIT: At the end I'm expecting a dataframe with, attribute name as the column name and their values as the column values. ie:

Invoice Number|Customer Number|...|Invoice Type

1234|543|...|Priced

Upvotes: 1

Views: 818

Answers (1)

DanielP
DanielP

Reputation: 51

I hope I can help you or at least point you in the right direction.

With nested structures you have to dissolve the layers step by step. When you create your dataframe, the schema is displayed below.

enter image description here

Here it is important to distinguish between arrays and structs. Structs can be resolved with a "select" expression and arrays with an "explode" function.

df_categories = df.select("categories.*")

With this "Select" you explicitly select the column "categories" and all its values. But note that you drop all other columns. If you want to keep them, you have to specify this as well.

The result would look like this: enter image description here

With this, we have resolved the whole thing somewhat, but that is still not enough for us. If we also dissolve the underlying structure, we will bring more structure into it.

enter image description here

Now we have an array in the highest level, which we have to explode. For this, the "explode" function must be imported beforehand.

enter image description here

Now we have reached the lowest level. All you have to do now is to pivot the rows into columns.

I hope I was able to help you.

EDIT:

You said that you want the values from "_name" as columns, but I don't know how useful that is in this context. You could use the following code to pivot:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Hinzufügen einer Spalte mit eindeutiger Identifikation für jede Zeile
df_attributes = df_attributes.withColumn("row_id", F.monotonically_increasing_id())

# Verwende die "pivot" Funktion, um die Einträge in "_name" als Spalten zu definieren
pivot_df = df_attributes.groupBy("row_id").pivot("_name").agg(F.first("_VALUE"))

# Optional: Fehlende Werte mit 0 füllen
pivot_df = pivot_df.fillna(0)

pivot_df.display()

In my opinion, it would make more sense to access the values directly from the "_name" column.

I hope I was able to help you.

Upvotes: 1

Related Questions