Reputation: 25
Right now I'm attempting to use lxml in Python 3.6. I want to remove "Programs" if they contain hedge, and remove the "Request" altogether if none of the Programs contain 'keep'. The xml is structured like this:
<Requests>
<Request>
<ProgramSelection>
<Program> <![CDATA[hedge]]> </Program>
<Program> <![CDATA[keep]] </Program>
</ProgramSelection>
</Request>
</Requests>
import lxml.etree
file_name = r'C:filename.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
for elem in tree.xpath("./Request[ProgramSelection/Program='hedge']"):
root.remove(elem)
Upvotes: 1
Views: 571
Reputation: 107642
Since you use the lxml
module, consider XSLT, the special-purpose language designed to transform XML files. With this approach, no for
loops or if
logic is required. Plus, XSLT is portable and so can be run it well beyond Python.
Following script runs the Identity Transform to copy document as is and then runs two empty template on needed logic to remove their content.
XSLT (save as .xsl file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select='normalize-space()'/>
</xsl:template>
<xsl:template match="Program[contains(text(),'hedge')]"/>
<xsl:template match="Request[not(contains(., 'keep'))]"/>
</xsl:stylesheet>
Python
import lxml.etree as et
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
transform = et.XSLT(xsl)
result = transform(doc)
# OUTPUT TO SCREEN
print(result)
# OUTPUT TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)
Output
<?xml version="1.0"?>
<Requests>
<Request>
<ProgramSelection>
<Program>keep</Program>
</ProgramSelection>
</Request>
</Requests>
Upvotes: 1
Reputation: 77347
You are close. The following two xpaths select elements matching your remove criteria
import lxml.etree
file_name = r'test.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
# remove <Request> lacking a <Program>keep</Program>
for request in tree.xpath(
"Request[not(ProgramSelection/Program[contains(text(),keep)])]"):
request.getparent().remove(request)
# remove <Program>hedge</Program>
for program in tree.xpath(
"Request/ProgramSelection/Program[contains(text(), hedge)]"):
program.getparent().remove(program)
print(lxml.etree.tostring(tree, pretty_print=True).decode())
And you can combine them into a bit less readable "or"
import lxml.etree
file_name = r'test.xml'
parser = lxml.etree.XMLParser(strip_cdata=False)
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
# remove <Request> lacking a <Program>keep</Program>
# remove <Program>hedge</Program>
for elem in tree.xpath("Request[
not(ProgramSelection/Program[contains(text(),keep)])]"
"|"
"Request/ProgramSelection/Program[contains(text(), hedge)]"):
elem.getparent().remove(elem)
print(lxml.etree.tostring(tree, pretty_print=True).decode())
Upvotes: 1