Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

Question

I'm trying to continually build an xml-file with python and with etree.xmlfile from lxml.

My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml, make some changes to the names of the attributes, and then write it to a new file.

This is my code, broken down:

with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:                                             
    with xf.element("corpus"):                                                                      
        for _, element in etree.iterparse(path_to_original_file, tag="comment"):                                               
            new_element = transform_element(element)                                                                                                   
            xf.write(new_element)
            del element
            del new_element

In the original file, I might have an element like this:

Some text with umlauts like this üä

But after processing, the same comment in the new file looks like this:

Some text with umlauts like this üä

Do you have any idea what might cause this?

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

Answers (1)

Related Questions