Reputation: 151
I'm trying to continually build an xml-file with python and with etree.xmlfile
from lxml
.
My input is an XML-file, where there are umlauts in attribute values. I read this in with lxml
, make some changes to the names of the attributes, and then write it to a new file.
This is my code, broken down:
with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:
with xf.element("corpus"):
for _, element in etree.iterparse(path_to_original_file, tag="comment"):
new_element = transform_element(element)
xf.write(new_element)
del element
del new_element
In the original file, I might have an element like this:
<comment title="Kübel">Some text with umlauts like this üä</comment>
But after processing, the same comment in the new file looks like this:
<comment title="Kübel">Some text with umlauts like this üä</comment>
Do you have any idea what might cause this?
Upvotes: 1
Views: 425
Reputation: 111581
ü
does not have to be escaped in an XML attribute value (or in a text node child of an element).
Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of <
, which always has to be escaped, and '
or "
which have to be escaped when matching the delimiting quotation mark for the attribute value.
For precise escaping requirements concisely presented, see Simplified XML Escaping.
Upvotes: 2