Steffen Rheinhold
Steffen Rheinhold

Reputation: 23

Why does python ElementTree parse and output (tostring) newlines and spaces?

Given the following two sample XMLs:

indented.xml:

<RestInterface>
    <message id="9" timestamp="2022-10-30 20:54:27.493">
        <response objectType="Org" id="Outlet.1">
            <attr appId="APP1" name="timezone">
                <value>Europe/Berlin</value>
            </attr>
            <attr appId="APP2" name="some_name">
                <value>12345</value>
            </attr>
        </response>
    </message>
</RestInterface>

oneline.xml:

<RestInterface><message id="9" timestamp="2022-10-30 20:54:27.493"><response objectType="Org" id="Outlet.1"><attr appId="APP1" name="timezone"><value>Europe/Berlin</value></attr><attr appId="APP2" name="some_name"><value>12345</value></attr></response></message></RestInterface>

which contain exactly the same XML content, I'm getting two different results with ElementTree.tostring() :

Python code:

import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root = tree.getroot()
s = ET.tostring(root)
print(s)

Output for filename='indented.xml':

b'<RestInterface>\n    <message id="9" timestamp="2022-10-30 20:54:27.493">\n        <response objectType="Org" id="Outlet.1">\n            <attr appId="APP1" name="timezone">\n                <value>Europe/Berlin</value>\n            </attr>\n            <attr appId="APP2" name="some_name">\n                <value>12345</value>\n            </attr>\n        </response>\n    </message>\n</RestInterface>'

Output for filename='oneline.xml':

b'<RestInterface><message id="9" timestamp="2022-10-30 20:54:27.493"><response objectType="Org" id="Outlet.1"><attr appId="APP1" name="timezone"><value>Europe/Berlin</value></attr><attr appId="APP2" name="some_name"><value>12345</value></attr></response></message></RestInterface>'

When I print the dump() output for each XML object, I get similar results: Both objects are being printed exactly how they are provided in the input XML files (newlines + indentation, versus single line).

Python version: 3.9.14

I was expecting the output to be the same for both files, as the XML had been parsed into an object and the ".tostring()" should create the output from the Python object's elements. But instead it adds the indentation and newlines from the input XML file. As the XMLParser of Elementtree uses the "expat" parser, I guess this is a problem with expat. But my programming skills are limited, so I can't drill deeper here.

Besides the fact, that this seems to be a bug and is pretty confusing - Did someone see the same problem? is there any known fix for this?

Upvotes: 0

Views: 764

Answers (1)

Jason
Jason

Reputation: 15931

According to the documentation - https://docs.python.org/3/library/xml.etree.elementtree.html - the method canonicalize includes a parameter strip_text to control whitespace handling.

strip_text: set to true to strip whitespace before and after text content

(default: false)

print(canonicalize(ET.tostring(root), strip_text = True))

Conversely, the function indent pretty prints an xml document

Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output. tree can be an Element or ElementTree. space is the whitespace string that will be inserted for each indentation level, two space characters by default. For indenting partial subtrees inside of an already indented tree, pass the initial indentation level as level.

print(ET.indent(' ', 0))

Disclaimer - I'm pretty new to Python myself, you may have to play around with these functions a bit to get it working, but, hopefully they put you on the right path.

Upvotes: 2

Related Questions