Reputation: 515
I have below xml data.
<transaction>
<date>20190415</date>
<ticket>303434037</ticket>
<value>15</value>
<notenders>01</notenders>
<tenderdetail>
<tendertype>00</tendertype>
<tenderamt>15</tenderamt>
</tenderdetail>
<item>
<receipeno>00001096</receipeno>
<price>7</price>
<qty>0001</qty>
<items>
<item>
<receipeno>00000786</receipeno>
<price>8</price>
<qty>0001</qty>
<items>
<item>
<receipeno>00000599</receipeno>
<price>0</price>
<qty>1</qty>
</item>
<item>
<receipeno>00000605</receipeno>
<price>0</price>
<qty>1</qty>
</item>
<item>
<receipeno>00000608</receipeno>
<price>0</price>
<qty>0004</qty>
</item>
</items>
</item>
<item>
<receipeno>10000043</receipeno>
<price>0</price>
<qty>0001</qty>
</item>
<item>
<receipeno>00000381</receipeno>
<price>7</price>
<qty>0001</qty>
<items>
<item>
<receipeno>00000607</receipeno>
<price>0</price>
<qty>1</qty>
</item>
</items>
</item>
</items>
</item>
</transaction>
I need to convert this to a table format. the issues is there are many nested branches inside each tag. eg many <item>
& <items>
tags. irrespective of the nested-ness. I need to list down the data one below he other.
my desired output is as follows
+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+
| date | ticket | value | notenders | tendertype | tenderamt | receipeno | price | qty |
+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 1096 | 7 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 786 | 8 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 599 | 0 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 605 | 0 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 608 | 0 | 4 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 143 | 0 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 381 | 7 | 1 |
| 20190101 | 12345 | 15 | 1 | 0 | 15 | 607 | 0 | 1 |
+----------+--------+-------+-----------+------------+-----------+-----------+-------+-----+
I'm new to python & XML parsing. Hence, kindly, direct me to solve this. ...
Upvotes: 1
Views: 1599
Reputation: 648
You can try the following code to get all data from nested XML File, but i think there is an elegant way to achieve the result:
import pandas as pd, numpy as np
import xml.etree.ElementTree as ET
xml_data = 'your xml data'
# Prepare for the list of variable to save XML data
date=[]
ticket=[]
value=[]
notenders=[]
tendertype=[]
tenderamt=[]
receipeno=[]
price=[]
qty=[]
# Parse the XML File to get the desired data
root = ET.fromstring(xml_data)
# Get header data from XML (date, ticket, value, notenders, tenderdetail)
date.append(root.find('date').text)
ticket.append(root.find('ticket').text)
value.append(root.find('value').text)
notenders.append(int(root.find('notenders').text))
nested_node0=root.findall('tenderdetail')
for child0 in nested_node0:
tendertype.append(int(child0.find('tendertype').text))
tenderamt.append(int(child0.find('tenderamt').text))
# Get all data under first item tag
nested_node1 = root.findall('item') #1
for child in nested_node1:
receipeno.append(int(child.find('receipeno').text))
price.append(int(child.find('price').text))
qty.append(int(child.find('qty').text))
# Get all data under first items tag
nested_node2 = child.findall('items') #2
for child2 in nested_node2:
# Get all data under second item tag
nested_node3 = child2.findall('item') #3
for child3 in nested_node3:
receipeno.append(int(child3.find('receipeno').text))
price.append(int(child3.find('price').text))
qty.append(int(child3.find('qty').text))
# Get all data under second items tag
nested_node4 = child3.findall('items') #4
for child4 in nested_node4:
# Get all data under third item tag
nested_node5 = child4.findall('item') #5
for child5 in nested_node5:
receipeno.append(int(child5.find('receipeno').text))
price.append(int(child5.find('price').text))
qty.append(int(child5.find('qty').text))
# Make the same length of every list of data with the max length
date.extend([np.nan]*(len(receipeno)-len(date)))
ticket.extend([np.nan]*(len(receipeno)-len(ticket)))
value.extend([np.nan]*(len(receipeno)-len(value)))
notenders.extend([np.nan]*(len(receipeno)-len(notenders)))
tendertype.extend([np.nan]*(len(receipeno)-len(tendertype)))
tenderamt.extend([np.nan]*(len(receipeno)-len(tenderamt)))
data={'date':date,
'ticket':ticket,
'value':value,
'notenders':notenders,
'tendertype':tendertype,
'tenderamt':tenderamt,
'receipeno': receipeno,
'price': price,
'qty':qty}
# Create DataFrame from data
df = pd.DataFrame(data)
df = df.fillna(method='ffill')
df
Output:
Hope this can help you.
Upvotes: 1
Reputation: 30981
Start from necessary imports:
import pandas as pd
import xml.etree.ElementTree as et
import re
Then, to remove leading zeroes from tags to be read, define the following function:
def stripLZ(src):
return re.sub(r'^0+(?=\d)', '', src)
To read the source file and its root element, execute:
tree = et.parse('transaction.xml')
root = tree.getroot()
To read tags from the root level (other than read from items), execute:
dt = root.find('date').text
tck = root.find('ticket').text
val = root.find('value').text
notend = stripLZ(root.find('notenders').text)
Two remaining tags are one level down, so start from reading their parent tag:
tdet = root.find('tenderdetail')
and read these tags from it:
tendtyp = stripLZ(tdet.find('tendertype').text)
tendamt = tdet.find('tenderamt').text
Note that I used stripLZ function here (it will be used a few times more).
Now there is time to create the result DataFrame:
df_cols = ['date', 'ticket', 'value', 'notenders',
'tendertype', 'tenderamt', 'receipeno', 'price', 'qty']
df = pd.DataFrame(columns = df_cols)
And the loading loop can be performed using iter method:
for it in root.iter('item'):
rcp = it.find('receipeno').text
prc = it.find('price').text
qty = stripLZ(it.find('qty').text)
df = df.append(pd.Series([dt, tck, val, notend,
tendtyp, tendamt, rcp, prc, qty],
index = df_cols), ignore_index=True)
This loop:
Upvotes: 1