Sir Muffington
Sir Muffington

Reputation: 321

How do you properly fetch from this nested XML?

I have the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <columns>
    <Leftover index="5">Leftover</Leftover>
    <NODE5 index="6"></NODE5>
    <NODE6 index="7"></NODE6>
    <NODE8 index="9"></NODE8>
    <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
    <Year index="8">2020</Year>
    <Name index="1">Name</Name>
    <Value_code index="3">Value code</Value_code>
  </columns>
  <records>
    <record index="1">
      <Leftover>Leftover</Leftover>
      <NODE5>Test1</NODE5>
      <NODE6>Test2</NODE6>
      <NODE8>Test3</NODE8>
      <Nomenk__Nr_></Nomenk__Nr_>
      <Name></Name>
      <Value_code></Value_code>
    </record>
  ... (it repeats itself with different values and the index value increments)

My code is:

import lxml
import lxml.etree as et
xml = open('C:\outputfile.xml', 'rb')
xml_content = xml.read()
tree = et.fromstring(xml_content)
for bad in tree.xpath("//records[@index=\'*\']/NODE5"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it
result = (et.tostring(tree, pretty_print=True, xml_declaration=True))
f = open( 'outputxml.xml', 'w' )
f.write( str(result) )
f.close()

What I need to do is to remove the NODE5, NODE6, NODE8. I tried using a wildcard and then specifying one of the nodes (see line 6) but that seems to not have worked... I'm also getting a syntax error right after the loop on the first character but the code executes.

My problem is also that the encoding by lxml is set to ASCII afterwards when the file is "exported".

UPDATE I am getting this error on line 8:

    return = ...
    ^
SyntaxError: invalid syntax

I took some code from https://stackoverflow.com/a/7981894/1987598

Upvotes: 0

Views: 45

Answers (1)

balderman
balderman

Reputation: 23815

What I need to do is to remove the NODE5, NODE6, NODE8.

below

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
   <columns>
      <Leftover index="5">Leftover</Leftover>
      <NODE5 index="6" />
      <NODE6 index="7" />
      <NODE8 index="9" />
      <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
      <Year index="8">2020</Year>
      <Name index="1">Name</Name>
      <Value_code index="3">Value code</Value_code>
   </columns>
   <records>
      <record index="1">
         <Leftover>Leftover</Leftover>
         <NODE5>Test1</NODE5>
         <NODE6>Test2</NODE6>
         <NODE8>Test3</NODE8>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>
      <record index="21">
         <Leftover>Leftover</Leftover>
         <NODE5>Test11</NODE5>
         <NODE6>Test21</NODE6>
         <NODE8>Test39</NODE8>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>      
   </records>
</data>'''

root = ET.fromstring(xml)

col = root.find('./columns')
for x in ['5','6','8']:
    nodes_to_remove = col.findall('./NODE{}'.format(x))
    for node in nodes_to_remove:
        col.remove(node)
records = root.find('./records')
records_lst = records.findall('./record'.format(x))
for r in records_lst:
    for x in ['5','6','8']:
        nodes_to_remove = r.findall('./NODE{}'.format(x))
        for node in nodes_to_remove:
            r.remove(node)
       
ET.dump(root)

output

<data>
   <columns>
      <Leftover index="5">Leftover</Leftover>
      <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
      <Year index="8">2020</Year>
      <Name index="1">Name</Name>
      <Value_code index="3">Value code</Value_code>
   </columns>
   <records>
      <record index="1">
         <Leftover>Leftover</Leftover>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>
      <record index="2">
         <Leftover>Leftover</Leftover>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>      
   </records>
</data>

Upvotes: 1

Related Questions