Reputation: 3619
I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file. My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them. When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
ü
When I create the output in Python, I use
with open('file', 'w') as fout:
fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
Upvotes: 0
Views: 8411
Reputation: 1123400
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout:
fout.write(xyz)
Otherwise the open
call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout:
fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü
), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.
Upvotes: 2
Reputation: 20792
Some explanation for the xml.etree.ElementTree
for Python 2, and for its function parse()
. The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree
instance, and then it passes the argument to the tree.parse(...)
that looks like this:
def parse(self, source, parser=None):
if not hasattr(source, "read"):
source = open(source, "rb")
if not parser:
parser = XMLParser(target=TreeBuilder())
while 1:
data = source.read(65536)
if not data:
break
parser.feed(data)
self._root = parser.close()
return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8')
for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...)
. This way, the ElementTree
instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...)
with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f)
, or you let the tree.write(filename, encoding='utf-8')
open the file for you.
Upvotes: 1
Reputation: 414585
To write an ElementTree
object tree
to a file named 'file'
using the 'utf-8'
character encoding:
tree.write('file', encoding='utf-8')
Upvotes: 2
Reputation: 2254
I think this should work:
import codecs
with codecs.open("file.xml", 'w', "utf-8") as fout:
# do stuff with filepointer
Upvotes: 2