Reputation: 23
Given the following two sample XMLs:
indented.xml:
<RestInterface>
<message id="9" timestamp="2022-10-30 20:54:27.493">
<response objectType="Org" id="Outlet.1">
<attr appId="APP1" name="timezone">
<value>Europe/Berlin</value>
</attr>
<attr appId="APP2" name="some_name">
<value>12345</value>
</attr>
</response>
</message>
</RestInterface>
oneline.xml:
<RestInterface><message id="9" timestamp="2022-10-30 20:54:27.493"><response objectType="Org" id="Outlet.1"><attr appId="APP1" name="timezone"><value>Europe/Berlin</value></attr><attr appId="APP2" name="some_name"><value>12345</value></attr></response></message></RestInterface>
which contain exactly the same XML content, I'm getting two different results with ElementTree.tostring() :
Python code:
import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root = tree.getroot()
s = ET.tostring(root)
print(s)
Output for filename='indented.xml':
b'<RestInterface>\n <message id="9" timestamp="2022-10-30 20:54:27.493">\n <response objectType="Org" id="Outlet.1">\n <attr appId="APP1" name="timezone">\n <value>Europe/Berlin</value>\n </attr>\n <attr appId="APP2" name="some_name">\n <value>12345</value>\n </attr>\n </response>\n </message>\n</RestInterface>'
Output for filename='oneline.xml':
b'<RestInterface><message id="9" timestamp="2022-10-30 20:54:27.493"><response objectType="Org" id="Outlet.1"><attr appId="APP1" name="timezone"><value>Europe/Berlin</value></attr><attr appId="APP2" name="some_name"><value>12345</value></attr></response></message></RestInterface>'
When I print the dump() output for each XML object, I get similar results: Both objects are being printed exactly how they are provided in the input XML files (newlines + indentation, versus single line).
Python version: 3.9.14
I was expecting the output to be the same for both files, as the XML had been parsed into an object and the ".tostring()" should create the output from the Python object's elements. But instead it adds the indentation and newlines from the input XML file. As the XMLParser of Elementtree uses the "expat" parser, I guess this is a problem with expat. But my programming skills are limited, so I can't drill deeper here.
Besides the fact, that this seems to be a bug and is pretty confusing - Did someone see the same problem? is there any known fix for this?
Upvotes: 0
Views: 764
Reputation: 15931
According to the documentation - https://docs.python.org/3/library/xml.etree.elementtree.html - the method canonicalize includes a parameter strip_text
to control whitespace handling.
strip_text: set to true to strip whitespace before and after text content
(default: false)
print(canonicalize(ET.tostring(root), strip_text = True))
Conversely, the function indent pretty prints an xml document
Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output. tree can be an Element or ElementTree. space is the whitespace string that will be inserted for each indentation level, two space characters by default. For indenting partial subtrees inside of an already indented tree, pass the initial indentation level as level.
print(ET.indent(' ', 0))
Disclaimer - I'm pretty new to Python myself, you may have to play around with these functions a bit to get it working, but, hopefully they put you on the right path.
Upvotes: 2