backendboi
backendboi

Reputation: 151

Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?

I'm trying to continually build an xml-file with python and with etree.xmlfile from lxml.

My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml, make some changes to the names of the attributes, and then write it to a new file.

This is my code, broken down:

with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:                                             
    with xf.element("corpus"):                                                                      
        for _, element in etree.iterparse(path_to_original_file, tag="comment"):                                               
            new_element = transform_element(element)                                                                                                   
            xf.write(new_element)
            del element
            del new_element

In the original file, I might have an element like this:

<comment title="Kübel">Some text with umlauts like this üä</comment>

But after processing, the same comment in the new file looks like this:

<comment title="K&#xFC;bel">Some text with umlauts like this üä</comment>

Do you have any idea what might cause this?

Upvotes: 1

Views: 425

Answers (1)

kjhughes
kjhughes

Reputation: 111581

ü does not have to be escaped in an XML attribute value (or in a text node child of an element).

Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of <, which always has to be escaped, and ' or " which have to be escaped when matching the delimiting quotation mark for the attribute value.

For precise escaping requirements concisely presented, see Simplified XML Escaping.

Upvotes: 2

Related Questions