Why does python ElementTree parse and output (tostring) newlines and spaces?

Question

Given the following two sample XMLs:

indented.xml:


    
        
            
                Europe/Berlin
            
            
                12345

oneline.xml:

Europe/Berlin12345

which contain exactly the same XML content, I'm getting two different results with ElementTree.tostring() :

Python code:

import xml.etree.ElementTree as ET
tree = ET.parse(filename)
root = tree.getroot()
s = ET.tostring(root)
print(s)

Output for filename='indented.xml':

b'
    
        
            
                Europe/Berlin
            
            
                12345
            
        
    
'

Output for filename='oneline.xml':

b'Europe/Berlin12345'

When I print the dump() output for each XML object, I get similar results: Both objects are being printed exactly how they are provided in the input XML files (newlines + indentation, versus single line).

Python version: 3.9.14

I was expecting the output to be the same for both files, as the XML had been parsed into an object and the ".tostring()" should create the output from the Python object's elements. But instead it adds the indentation and newlines from the input XML file. As the XMLParser of Elementtree uses the "expat" parser, I guess this is a problem with expat. But my programming skills are limited, so I can't drill deeper here.

Besides the fact, that this seems to be a bug and is pretty confusing - Did someone see the same problem? is there any known fix for this?

Jason · Accepted Answer

According to the documentation - https://docs.python.org/3/library/xml.etree.elementtree.html - the method canonicalize includes a parameter strip_text to control whitespace handling.

strip_text: set to true to strip whitespace before and after text content

(default: false)

print(canonicalize(ET.tostring(root), strip_text = True))

Conversely, the function indent pretty prints an xml document

Appends whitespace to the subtree to indent the tree visually. This can be used to generate pretty-printed XML output. tree can be an Element or ElementTree. space is the whitespace string that will be inserted for each indentation level, two space characters by default. For indenting partial subtrees inside of an already indented tree, pass the initial indentation level as level.

print(ET.indent(' ', 0))

Disclaimer - I'm pretty new to Python myself, you may have to play around with these functions a bit to get it working, but, hopefully they put you on the right path.

Why does python ElementTree parse and output (tostring) newlines and spaces?

Answers (1)

Related Questions