fascynacja
fascynacja

Reputation: 2888

How to process and update (change attribute, add node, etc) XML file with a DOCTYPE in Python, without removing nor altering the "DOCTYPE"

I have couple of xml files which I would like to process and update their nodes/attributes. I have couple of examples of scripts which can do that, but all of them change a bit the xml structure, remove or alter the DOCTYPE. The simplified example of xml is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note  xmlns:note="http://example.com/note">
  <to checksum="abc">Tove</to> 
</note:note>

the DTD note.dtd is:

<!ELEMENT note:note (to)>
<!ELEMENT to (#PCDATA)>
 <!ATTLIST to
    checksum CDATA #REQUIRED
>

Example python script which updates argument value is:

    @staticmethod
    def replace_checksum_in_index_xml(infile, checksum_new, outfile):
        from lxml import etree
        parser = etree.XMLParser(remove_blank_text=True)
        with open(infile, "rb") as f:
            tree = etree.parse(f, parser)

        for elem in tree.xpath("//to[@checksum]"):
            elem.set("checksum", checksum_new)

        with open(outfile, "wb") as f:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)

I call the script like that:

    infile = "Input.xml"
    check_sum = "aaabbb"
    outfile = "Output.xml"
    Hashes.replace_checksum_in_index_xml(infile, check_sum, outfile)

And the result xml file is:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="aaabbb">Tove</to>
</note:note>

The output DOCTYPE has changed and instead of
DOCTYPE note:note
there is
DOCTYPE note I would like to keep the DOCTYPE as it was. How can I achieve desired result in Python?

Upvotes: 0

Views: 93

Answers (2)

Yitzhak Khabinsky
Yitzhak Khabinsky

Reputation: 22293

Please try the following solution based on XSLT.

While working with XML it is better to use its native APIs: XPath, XSLT, XQuery, XSD, etc. called from any of general programming languages: Python, Java, c#, c++, etc.

The XSLT below is using a so called Identity Transform pattern. It is copying/cloning input XML document as-is to to the output stream, including processing instructions like DOCTYPE, etc., except what is specified in additional XSLT templates. In our case it is the <xsl:template match="@checksum"> template for the @checksum attribute that needs its value modification.

If there is a need to make multiple modifications to the input XML file additional XSLT template(s) would be needed to handle such task in one shot via the same XSLT file.

Additionally, you can pass as many as needed parameters via Python and XSLT. Python code is using a dictionary to handle that.

Input XML

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
    <to checksum="abc">Tove</to>
</note:note>

XSLT 1.0

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:note="http://example.com/note">
    <xsl:output omit-xml-declaration="no" encoding="utf-8"
                indent="yes" doctype-system="note.dtd"/>
    <xsl:strip-space elements="*"/>

    <xsl:param name="newValue" select="'wowString'"/>

    <!--identity transform-->
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@checksum">
        <xsl:attribute name="checksum">
            <xsl:value-of select="$newValue"/>
        </xsl:attribute>
    </xsl:template>
</xsl:stylesheet>

Output XML

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="magicString">Tove</to>
</note:note>

Python

import lxml.etree as et

inputfile = "e:\\Temp\\Identity Transformation126\\Input_2.xml"
xsltfile = "e:\\Temp\\Identity Transformation126\\Process_2.xslt"
outfile = "e:\\Temp\\Identity Transformation126\\Output_2.xml"

# PARAMETERS
argDict = {}
argDict["newValue"] = et.XSLT.strparam("magicString")

# XSLT TRANSFORMATION
transform = et.XSLT(et.parse(xsltfile))
result = transform(et.parse(inputfile), **argDict)

# OUTPUT TO FILE
with open(outfile, "wb") as f:
    f.write(result)

Upvotes: 1

Hermann12
Hermann12

Reputation: 3581

I think the stripping of the prefix in the doctype is no failure. If you really like the prefix you can write it explictly:

from lxml import etree

def read_doctype_name(filename):
    with open(filename, "r", encoding="UTF-8") as f:
        for line in f:
            if line.startswith('<!DOCTYPE'):
                return line
    
def replace_checksum_in_index_xml(infile, checksum_new, outfile, docName):
    parser = etree.XMLParser(remove_blank_text=True)
    with open(infile, "rb") as f:
        tree = etree.parse(f, parser)

    for elem in tree.xpath("//to[@checksum]"):
        elem.set("checksum", checksum_new)

    with open(outfile, "wb") as f:
        if docName is not None:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=docName.strip())
        else:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8")
        
if __name__ == "__main__":
    doctyp = None
    doctyp = read_doctype_name("infile.xml")
    replace_checksum_in_index_xml("infile.xml", "aaabbb", "outfile.xml", doctyp)
    print("finish")

File:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="aaabbb">Tove</to>
</note:note>

Alternative you can use a regex function to extract the DOCTYPE string:

import re

def read_doctype_name(filename):
    with open(filename, "r", encoding="UTF-8") as f:
        xml_ = f.read()
        if re.search(r'<!DOCTYPE[^\>]*>', xml_) is not None:
            doctype_match = re.search(r'<!DOCTYPE[^\>]*>', xml_)
            return doctype_match[0]
        else:
            return None

Upvotes: 1

Related Questions