Reputation: 2888
I have couple of xml files which I would like to process and update their nodes/attributes. I have couple of examples of scripts which can do that, but all of them change a bit the xml structure, remove or alter the DOCTYPE. The simplified example of xml is:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="abc">Tove</to>
</note:note>
the DTD note.dtd is:
<!ELEMENT note:note (to)>
<!ELEMENT to (#PCDATA)>
<!ATTLIST to
checksum CDATA #REQUIRED
>
Example python script which updates argument value is:
@staticmethod
def replace_checksum_in_index_xml(infile, checksum_new, outfile):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
with open(infile, "rb") as f:
tree = etree.parse(f, parser)
for elem in tree.xpath("//to[@checksum]"):
elem.set("checksum", checksum_new)
with open(outfile, "wb") as f:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)
I call the script like that:
infile = "Input.xml"
check_sum = "aaabbb"
outfile = "Output.xml"
Hashes.replace_checksum_in_index_xml(infile, check_sum, outfile)
And the result xml file is:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="aaabbb">Tove</to>
</note:note>
The output DOCTYPE has changed and instead of
DOCTYPE note:note
there is
DOCTYPE note
I would like to keep the DOCTYPE as it was.
How can I achieve desired result in Python?
Upvotes: 0
Views: 93
Reputation: 22293
Please try the following solution based on XSLT.
While working with XML it is better to use its native APIs: XPath, XSLT, XQuery, XSD, etc. called from any of general programming languages: Python, Java, c#, c++, etc.
The XSLT below is using a so called Identity Transform pattern. It is copying/cloning input XML document as-is to to the output stream, including processing instructions like DOCTYPE, etc., except what is specified in additional XSLT templates. In our case it is the <xsl:template match="@checksum">
template for the @checksum
attribute that needs its value modification.
If there is a need to make multiple modifications to the input XML file additional XSLT template(s) would be needed to handle such task in one shot via the same XSLT file.
Additionally, you can pass as many as needed parameters via Python and XSLT. Python code is using a dictionary to handle that.
Input XML
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="abc">Tove</to>
</note:note>
XSLT 1.0
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:note="http://example.com/note">
<xsl:output omit-xml-declaration="no" encoding="utf-8"
indent="yes" doctype-system="note.dtd"/>
<xsl:strip-space elements="*"/>
<xsl:param name="newValue" select="'wowString'"/>
<!--identity transform-->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@checksum">
<xsl:attribute name="checksum">
<xsl:value-of select="$newValue"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Output XML
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="magicString">Tove</to>
</note:note>
Python
import lxml.etree as et
inputfile = "e:\\Temp\\Identity Transformation126\\Input_2.xml"
xsltfile = "e:\\Temp\\Identity Transformation126\\Process_2.xslt"
outfile = "e:\\Temp\\Identity Transformation126\\Output_2.xml"
# PARAMETERS
argDict = {}
argDict["newValue"] = et.XSLT.strparam("magicString")
# XSLT TRANSFORMATION
transform = et.XSLT(et.parse(xsltfile))
result = transform(et.parse(inputfile), **argDict)
# OUTPUT TO FILE
with open(outfile, "wb") as f:
f.write(result)
Upvotes: 1
Reputation: 3581
I think the stripping of the prefix in the doctype is no failure. If you really like the prefix you can write it explictly:
from lxml import etree
def read_doctype_name(filename):
with open(filename, "r", encoding="UTF-8") as f:
for line in f:
if line.startswith('<!DOCTYPE'):
return line
def replace_checksum_in_index_xml(infile, checksum_new, outfile, docName):
parser = etree.XMLParser(remove_blank_text=True)
with open(infile, "rb") as f:
tree = etree.parse(f, parser)
for elem in tree.xpath("//to[@checksum]"):
elem.set("checksum", checksum_new)
with open(outfile, "wb") as f:
if docName is not None:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=docName.strip())
else:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8")
if __name__ == "__main__":
doctyp = None
doctyp = read_doctype_name("infile.xml")
replace_checksum_in_index_xml("infile.xml", "aaabbb", "outfile.xml", doctyp)
print("finish")
File:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="aaabbb">Tove</to>
</note:note>
Alternative you can use a regex function to extract the DOCTYPE string:
import re
def read_doctype_name(filename):
with open(filename, "r", encoding="UTF-8") as f:
xml_ = f.read()
if re.search(r'<!DOCTYPE[^\>]*>', xml_) is not None:
doctype_match = re.search(r'<!DOCTYPE[^\>]*>', xml_)
return doctype_match[0]
else:
return None
Upvotes: 1