Ger Cas
Ger Cas

Reputation: 2298

Remove repeated data extracted from XML

I'm trying to extract some information from an XML file and tabulate it in the way is shown in image below. My code below is almost working, but I'm stuck in how to fix the repeated values (rows of data) that appear in print out. Below I show the current output, desired output, the code I have so far and xml input file. Thanks for any help.

enter image description here

Python code

import xml.etree.ElementTree as ET
import glob, os
import pandas as pd

filename = 'input.xml'  
root = ET.parse(filename).getroot()

rows = []
columns = ['TC', 'LPP','LRP','LKTT','WQNT','ZZ','DFT','YY_TRWI','ZZ_TRWE']

for element in root.findall('.//NL/NW'):
    tc = element.find('.//TC').text
    llp = element.find('.//LPP').text
    lrp = element.find('.//LRP').text
    lktt = element.find('.//LKTT').text
    wqnt = element.find('.//WQNT').text
    
    for entry in root.findall('.//CEN4/GNR/RD/NR'):
        row = []
        zz = entry.find('.//ZZ').text
        dft = entry.find('.//DFT').text
        
        yy_trwi = getattr(entry.find('./SR/YY_TRWI'), 'text', None)
        zz_trwe = getattr(entry.find('./SR/ZZ_TRWE'), 'text', None)
        row.extend([tc,llp,lrp,lktt,wqnt,zz,dft,yy_trwi,zz_trwe])
        rows.append(row)

df = pd.DataFrame(rows, columns=columns)
df

XML file

<?xml version="1.0" encoding="UTF-8"?>
<vol-trf>
    <MAINXZ123FileHeader>
        <FT>XZ43</FT>
        <SNDTDG>AURUYYY</SNDTDG>
    </MAINXZ123FileHeader>
    <ON>
        <ORN>XYZ</ORN>
        <CI>AUR</CI>
        <NL>
            <NW>
                <TC>WEKP71</TC>
                <ND>
                    <RIS>
                        <RI>
                            <CEN4>
                                <GNR>
                                    <RD>
                                        <NR>
                                            <ZZ>403</ZZ>
                                            <DFT>676</DFT>
                                        </NR>
                                    </RD>
                                    <RD>
                                        <NR>
                                            <ZZ>419</ZZ>
                                            <DFT>170</DFT>
                                            <SR>
                                                <YY_TRWI>4001</YY_TRWI>
                                                <ZZ_TRWE>4015</ZZ_TRWE>
                                            </SR>
                                        </NR>
                                    </RD>
                                    <RD>
                                        <NR>
                                            <ZZ>429</ZZ>
                                            <DFT>170</DFT>
                                            <SR>
                                                <YY_TRWI>5574</YY_TRWI>
                                                <ZZ_TRWE>7614</ZZ_TRWE>
                                            </SR>
                                        </NR>
                                    </RD>
                                </GNR>
                            </CEN4>
                            <CEN2>
                                <LPP>2079</LPP>
                                <LRP>531</LRP>
                            </CEN2>
                            <CEM>
                                <LKTT>4113</LKTT>
                                <WQNT>5106</WQNT>
                            </CEM>
                        </RI>
                    </RIS>
                </ND>
            </NW>
            <NW>
                <TC>AURX45</TC>
                <ND>
                    <RIS>
                        <RI>
                            <CEN4>
                                <GNR>
                                    <RD>
                                        <NR>
                                            <ZZ>31</ZZ>
                                            <DFT>455</DFT>
                                        </NR>
                                    </RD>
                                    <RD>
                                        <NR>
                                            <ZZ>31</ZZ>
                                            <DFT>170</DFT>
                                            <SR>
                                                <YY_TRWI>123</YY_TRWI>
                                                <ZZ_TRWE>883</ZZ_TRWE>
                                            </SR>
                                        </NR>
                                    </RD>
                                    <RD>
                                        <NR>
                                            <ZZ>33</ZZ>
                                            <DFT>631</DFT>
                                            <SR>
                                                <YY_TRWI>324</YY_TRWI>
                                                <ZZ_TRWE>545</ZZ_TRWE>
                                            </SR>
                                        </NR>
                                    </RD>
                                </GNR>
                            </CEN4>
                            <CEN2>
                                <LPP>711</LPP>
                                <LRP>87</LRP>
                            </CEN2>
                            <CEM>
                                <LKTT>319</LKTT>
                                <WQNT>433</WQNT>
                            </CEM>
                        </RI>
                    </RIS>
                </ND>
            </NW>
        </NL>
    </ON>
</vol-trf>

Upvotes: 0

Views: 54

Answers (1)

LMC
LMC

Reputation: 12777

This is returning as expected (some clean up is needed but values are correct).
All Xpath expression make use of ancestor or sibling values or position to make a more precise query. Xpath positions starts at 1, not 0.

import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd

filename = '/home/lmc/tmp/temp2.xml'
root = etree.parse(filename)

rows = []
columns = ['TC','LPP','LRP','LKTT','WQNT','ZZ','DFT','YY_TRWI','ZZ_TRWE']

i=0
for element in root.xpath('//NL/NW/TC'):

    llp = element.xpath('//LPP/text()')
    lrp = element.xpath('//LRP')
    lktt = element.xpath('//LKTT')
    wqnt = element.xpath('//WQNT')
    y=0
    xp_base='//NW[TC[.="{}"]]'.format(element.text)
    xp='{}/descendant::CEN4/GNR/RD/NR'.format(xp_base)
    print(xp)
    for entry in element.xpath(xp):
        row = []
        xp_NR = '{}/descendant::NR[{}]'.format(xp_base, y+1)
        
        zz = entry.xpath('{}/descendant::ZZ/text()'.format(xp_NR))
        dft = entry.xpath('{}/descendant::DFT/text()'.format(xp_NR))
        
        
        xp = '{}/descendant::ZZ[.="{}"]/following-sibling::SR/YY_TRWI/text()'.format(xp_NR,zz[0])
        print(xp)
        yy_trwi = root.xpath(xp)
        xp = '{}/descendant::ZZ[.="{}"]/following-sibling::SR/ZZ_TRWE/text()'.format(xp_NR, zz[0])
        print(xp)
        zz_trwe = entry.xpath(xp)
        row.extend([element.text,llp[i],lrp[i].text,lktt[i].text,wqnt[i].text, zz, dft, yy_trwi, zz_trwe ])
        rows.append(row)
        y += 1
    
    i += 1

df = pd.DataFrame(rows, columns=columns)
print(df)

Result:

       TC   LPP  LRP  LKTT  WQNT     ZZ    DFT YY_TRWI ZZ_TRWE
0  WEKP71  2079  531  4113  5106  [403]  [676]      []      []
1  WEKP71  2079  531  4113  5106  [419]  [170]  [4001]  [4015]
2  WEKP71  2079  531  4113  5106  [429]  [170]  [5574]  [7614]
3  AURX45   711   87   319   433   [31]  [455]      []      []
4  AURX45   711   87   319   433   [31]  [170]   [123]   [883]
5  AURX45   711   87   319   433   [33]  [631]   [324]   [545]

Printing some xpath expression (xp)

//NW[TC[.="WEKP71"]]/descendant::CEN4/GNR/RD/NR
//NW[TC[.="WEKP71"]]/descendant::NR[1]/descendant::ZZ[.="403"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[1]/descendant::ZZ[.="403"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="WEKP71"]]/descendant::NR[2]/descendant::ZZ[.="419"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[2]/descendant::ZZ[.="419"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="WEKP71"]]/descendant::NR[3]/descendant::ZZ[.="429"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="WEKP71"]]/descendant::NR[3]/descendant::ZZ[.="429"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::CEN4/GNR/RD/NR
//NW[TC[.="AURX45"]]/descendant::NR[1]/descendant::ZZ[.="31"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[1]/descendant::ZZ[.="31"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::NR[2]/descendant::ZZ[.="31"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[2]/descendant::ZZ[.="31"]/following-sibling::SR/ZZ_TRWE/text()
//NW[TC[.="AURX45"]]/descendant::NR[3]/descendant::ZZ[.="33"]/following-sibling::SR/YY_TRWI/text()
//NW[TC[.="AURX45"]]/descendant::NR[3]/descendant::ZZ[.="33"]/following-sibling::SR/ZZ_TRWE/text()

Upvotes: 1

Related Questions