Reputation: 2298
I'm trying to extract some information from an XML file and tabulate it in the way is shown in image below. My code below is almost working, but I'm stuck in how to fix the repeated values (rows of data) that appear in print out. Below I show the current output, desired output, the code I have so far and xml input file. Thanks for any help.
row.extend(...)
in the code.Python code
import xml.etree.ElementTree as ET
import glob, os
import pandas as pd
filename = 'input.xml'
root = ET.parse(filename).getroot()
rows = []
columns = ['TC', 'LPP','LRP','LKTT','WQNT','ZZ','DFT','YY_TRWI','ZZ_TRWE']
for element in root.findall('.//NL/NW'):
tc = element.find('.//TC').text
llp = element.find('.//LPP').text
lrp = element.find('.//LRP').text
lktt = element.find('.//LKTT').text
wqnt = element.find('.//WQNT').text
for entry in root.findall('.//CEN4/GNR/RD/NR'):
row = []
zz = entry.find('.//ZZ').text
dft = entry.find('.//DFT').text
yy_trwi = getattr(entry.find('./SR/YY_TRWI'), 'text', None)
zz_trwe = getattr(entry.find('./SR/ZZ_TRWE'), 'text', None)
row.extend([tc,llp,lrp,lktt,wqnt,zz,dft,yy_trwi,zz_trwe])
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
df
XML file
<?xml version="1.0" encoding="UTF-8"?>
<vol-trf>
<MAINXZ123FileHeader>
<FT>XZ43</FT>
<SNDTDG>AURUYYY</SNDTDG>
</MAINXZ123FileHeader>
<ON>
<ORN>XYZ</ORN>
<CI>AUR</CI>
<NL>
<NW>
<TC>WEKP71</TC>
<ND>
<RIS>
<RI>
<CEN4>
<GNR>
<RD>
<NR>
<ZZ>403</ZZ>
<DFT>676</DFT>
</NR>
</RD>
<RD>
<NR>
<ZZ>419</ZZ>
<DFT>170</DFT>
<SR>
<YY_TRWI>4001</YY_TRWI>
<ZZ_TRWE>4015</ZZ_TRWE>
</SR>
</NR>
</RD>
<RD>
<NR>
<ZZ>429</ZZ>
<DFT>170</DFT>
<SR>
<YY_TRWI>5574</YY_TRWI>
<ZZ_TRWE>7614</ZZ_TRWE>
</SR>
</NR>
</RD>
</GNR>
</CEN4>
<CEN2>
<LPP>2079</LPP>
<LRP>531</LRP>
</CEN2>
<CEM>
<LKTT>4113</LKTT>
<WQNT>5106</WQNT>
</CEM>
</RI>
</RIS>
</ND>
</NW>
<NW>
<TC>AURX45</TC>
<ND>
<RIS>
<RI>
<CEN4>
<GNR>
<RD>
<NR>
<ZZ>31</ZZ>
<DFT>455</DFT>
</NR>
</RD>
<RD>
<NR>
<ZZ>31</ZZ>
<DFT>170</DFT>
<SR>
<YY_TRWI>123</YY_TRWI>
<ZZ_TRWE>883</ZZ_TRWE>
</SR>
</NR>
</RD>
<RD>
<NR>
<ZZ>33</ZZ>
<DFT>631</DFT>
<SR>
<YY_TRWI>324</YY_TRWI>
<ZZ_TRWE>545</ZZ_TRWE>
</SR>
</NR>
</RD>
</GNR>
</CEN4>
<CEN2>
<LPP>711</LPP>
<LRP>87</LRP>
</CEN2>
<CEM>
<LKTT>319</LKTT>
<WQNT>433</WQNT>
</CEM>
</RI>
</RIS>
</ND>
</NW>
</NL>
</ON>
</vol-trf>
Upvotes: 0
Views: 54
Reputation: 12777
This is returning as expected (some clean up is needed but values are correct).
All Xpath expression make use of ancestor or sibling values or position to make a more precise query. Xpath positions starts at 1, not 0.
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
filename = '/home/lmc/tmp/temp2.xml'
root = etree.parse(filename)
rows = []
columns = ['TC','LPP','LRP','LKTT','WQNT','ZZ','DFT','YY_TRWI','ZZ_TRWE']
i=0
for element in root.xpath('//NL/NW/TC'):
llp = element.xpath('//LPP/text()')
lrp = element.xpath('//LRP')
lktt = element.xpath('//LKTT')
wqnt = element.xpath('//WQNT')
y=0
xp_base='//NW[TC[.="{}"]]'.format(element.text)
xp='{}/descendant::CEN4/GNR/RD/NR'.format(xp_base)
print(xp)
for entry in element.xpath(xp):
row = []
xp_NR = '{}/descendant::NR[{}]'.format(xp_base, y+1)
zz = entry.xpath('{}/descendant::ZZ/text()'.format(xp_NR))
dft = entry.xpath('{}/descendant::DFT/text()'.format(xp_NR))
xp = '{}/descendant::ZZ[.="{}"]/following-sibling::SR/YY_TRWI/text()'.format(xp_NR,zz[0])
print(xp)
yy_trwi = root.xpath(xp)
xp = '{}/descendant::ZZ[.="{}"]/following-sibling::SR/ZZ_TRWE/text()'.format(xp_NR, zz[0])
print(xp)
zz_trwe = entry.xpath(xp)
row.extend([element.text,llp[i],lrp[i].text,lktt[i].text,wqnt[i].text, zz, dft, yy_trwi, zz_trwe ])
rows.append(row)
y += 1
i += 1
df = pd.DataFrame(rows, columns=columns)
print(df)
Result:
TC LPP LRP LKTT WQNT ZZ DFT YY_TRWI ZZ_TRWE
0 WEKP71 2079 531 4113 5106 [403] [676] [] []
1 WEKP71 2079 531 4113 5106 [419] [170] [4001] [4015]
2 WEKP71 2079 531 4113 5106 [429] [170] [5574] [7614]
3 AURX45 711 87 319 433 [31] [455] [] []
4 AURX45 711 87 319 433 [31] [170] [123] [883]
5 AURX45 711 87 319 433 [33] [631] [324] [545]
Printing some xpath expression (xp)
//NW[TC[.="WEKP71"]]/descendant::CEN4/GNR/RD/NR
//NW[TC[.="WEKP71"]]/descendant::NR[1]/descendant::ZZ[.="403"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[1]/descendant::ZZ[.="403"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="WEKP71"]]/descendant::NR[2]/descendant::ZZ[.="419"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[2]/descendant::ZZ[.="419"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="WEKP71"]]/descendant::NR[3]/descendant::ZZ[.="429"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[3]/descendant::ZZ[.="429"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::CEN4/GNR/RD/NR
//NW[TC[.="AURX45"]]/descendant::NR[1]/descendant::ZZ[.="31"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[1]/descendant::ZZ[.="31"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::NR[2]/descendant::ZZ[.="31"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[2]/descendant::ZZ[.="31"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::NR[3]/descendant::ZZ[.="33"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[3]/descendant::ZZ[.="33"]/following-sibling::SR/ZZ_TRWE/text()
Upvotes: 1