Parsing & converting nested xml in python

Question

I have below xml data.

I need to convert this to a table format. the issues is there are many nested branches inside each tag. eg many & tags. irrespective of the nested-ness. I need to list down the data one below he other.

my desired output is as follows

 +----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+
|   date   | ticket | value | notenders | tendertype | tenderamt | receipeno | price | qty |
+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+
| 20190101 |  12345 |    15 |         1 |          0 |        15 |      1096 |     7 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       786 |     8 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       599 |     0 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       605 |     0 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       608 |     0 |   4 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       143 |     0 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       381 |     7 |   1 |
| 20190101 |  12345 |    15 |         1 |          0 |        15 |       607 |     0 |   1 |
+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+

I'm new to python & XML parsing. Hence, kindly, direct me to solve this. ...

Valdi_Bo · Accepted Answer

Start from necessary imports:

import pandas as pd
import xml.etree.ElementTree as et
import re

Then, to remove leading zeroes from tags to be read, define the following function:

def stripLZ(src):
    return re.sub(r'^0+(?=\d)', '', src)

To read the source file and its root element, execute:

tree = et.parse('transaction.xml')
root = tree.getroot()

To read tags from the root level (other than read from items), execute:

dt = root.find('date').text
tck = root.find('ticket').text
val = root.find('value').text
notend = stripLZ(root.find('notenders').text)

Two remaining tags are one level down, so start from reading their parent tag:

tdet = root.find('tenderdetail')

and read these tags from it:

tendtyp = stripLZ(tdet.find('tendertype').text)
tendamt = tdet.find('tenderamt').text

Note that I used stripLZ function here (it will be used a few times more).

Now there is time to create the result DataFrame:

df_cols = ['date', 'ticket', 'value', 'notenders',
    'tendertype', 'tenderamt', 'receipeno', 'price', 'qty']
df = pd.DataFrame(columns = df_cols)

And the loading loop can be performed using iter method:

for it in root.iter('item'):
    rcp = it.find('receipeno').text
    prc = it.find('price').text
    qty = stripLZ(it.find('qty').text)
    df = df.append(pd.Series([dt, tck, val, notend,
        tendtyp, tendamt, rcp, prc, qty],
        index = df_cols), ignore_index=True)

This loop:

Iterates over all item tags, regardless if their depth.
Reads 3 tags from the current item.
Appends a row to the result DataFrame.

Parsing & converting nested xml in python

Answers (2)

Related Questions

Parsing &amp; converting nested xml in python

Answers (2)

Related Questions

Parsing & converting nested xml in python