Reputation: 1571
I have a XML file with Russian text:
<p>все чашки имеют стандартный посадочный диаметр - 22,2 мм</p>
I use xml.etree.ElementTree
to do manipulate it in various ways (without ever touching the text content). Then, I use ElementTree.tostring
:
info["table"] = ET.tostring(table, encoding="utf8") #table is an Element
Then I do some other stuff with this string, and finally write it to a file
f = open(newname, "w")
output = page_template.format(**info)
f.write(output)
f.close()
I wind up with this in my file:
<p>\xd0\xb2\xd1\x81\xd0\xb5 \xd1\x87\xd0\xb0\xd1\x88\xd0\xba\xd0\xb8 \xd0\xb8\xd0\xbc\xd0\xb5\xd1\x8e\xd1\x82 \xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xb4\xd0\xb0\xd1\x80\xd1\x82\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xbf\xd0\xbe\xd1\x81\xd0\xb0\xd0\xb4\xd0\xbe\xd1\x87\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb8\xd0\xb0\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80 - 22,2 \xd0\xbc\xd0\xbc</p>
How do I get it encoded properly?
Upvotes: 0
Views: 673
Reputation: 69082
You use
info["table"] = ET.tostring(table, encoding="utf8")
which returns bytes
. Then later you apply that to a format string, which is a str
(unicode), if you do that you'll end up with a representation of the bytes object.
etree can return an unicode object instead if you use:
info["table"] = ET.tostring(table, encoding="unicode")
Upvotes: 1
Reputation: 3639
Try this - with output parameter being just the Russian string without utf-8 encoding.
import codecs
#output=u'все чашки имеют стандартный посадочный диаметр'
with codecs.open(newname, "w", "utf-16") as stream: #or utf-8
stream.write(output + u"\n")
Upvotes: 0
Reputation: 1571
The problem is that ElementTree.tostring returns a binary object and not an actual string. The answer to this is:
info["table"] = ET.tostring(table, encoding="utf8").decode("utf8")
Upvotes: 0