Chris
Chris

Reputation: 25

How do I remove a parent node in XML file based on text in a child element? Using Python 3.6

Right now I'm attempting to use lxml in Python 3.6. I want to remove "Programs" if they contain hedge, and remove the "Request" altogether if none of the Programs contain 'keep'. The xml is structured like this:

<Requests>
   <Request>
        <ProgramSelection>
            <Program> <![CDATA[hedge]]> </Program>
            <Program> <![CDATA[keep]] </Program>
        </ProgramSelection>
    </Request>
</Requests>
import lxml.etree


file_name = r'C:filename.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()

for elem in tree.xpath("./Request[ProgramSelection/Program='hedge']"):
    root.remove(elem)

Upvotes: 1

Views: 571

Answers (2)

Parfait
Parfait

Reputation: 107642

Since you use the lxml module, consider XSLT, the special-purpose language designed to transform XML files. With this approach, no for loops or if logic is required. Plus, XSLT is portable and so can be run it well beyond Python.

Following script runs the Identity Transform to copy document as is and then runs two empty template on needed logic to remove their content.

XSLT (save as .xsl file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text()">
      <xsl:value-of select='normalize-space()'/>
  </xsl:template>

  <xsl:template match="Program[contains(text(),'hedge')]"/>
  <xsl:template match="Request[not(contains(., 'keep'))]"/>

</xsl:stylesheet>

Python

import lxml.etree as et

doc = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')

transform = et.XSLT(xsl)    
result = transform(doc)

# OUTPUT TO SCREEN
print(result)

# OUTPUT TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Output

<?xml version="1.0"?>
<Requests>
  <Request>
    <ProgramSelection>
      <Program>keep</Program>
    </ProgramSelection>
  </Request>
</Requests>

Online Demo

Upvotes: 1

tdelaney
tdelaney

Reputation: 77347

You are close. The following two xpaths select elements matching your remove criteria

import lxml.etree

file_name = r'test.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()

# remove <Request> lacking a <Program>keep</Program>
for request in tree.xpath(
        "Request[not(ProgramSelection/Program[contains(text(),keep)])]"):
    request.getparent().remove(request)

# remove <Program>hedge</Program>
for program in tree.xpath(
        "Request/ProgramSelection/Program[contains(text(), hedge)]"):
    program.getparent().remove(program)

print(lxml.etree.tostring(tree, pretty_print=True).decode())

And you can combine them into a bit less readable "or"

import lxml.etree

file_name = r'test.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()

# remove <Request> lacking a <Program>keep</Program>
# remove <Program>hedge</Program>
for elem in tree.xpath("Request[
        not(ProgramSelection/Program[contains(text(),keep)])]"
        "|"        
        "Request/ProgramSelection/Program[contains(text(), hedge)]"):
    elem.getparent().remove(elem)

print(lxml.etree.tostring(tree, pretty_print=True).decode())

Upvotes: 1

Related Questions