Kaly
Kaly

Reputation: 3619

Python: How to preserve Ä,Ö,Ü when writing to file

I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file. My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them. When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like

ü

When I create the output in Python, I use

with open('file', 'w') as fout:
  fout.write(etree.tostring(tree.getroot()).decode('utf-8'))

What do I have to do instead?

Upvotes: 0

Views: 8411

Answers (4)

Martijn Pieters
Martijn Pieters

Reputation: 1123400

When writing raw bytestrings, you want to open the file in binary mode:

with open('file', 'wb') as fout:
    fout.write(xyz)

Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.

To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:

with open(file, 'w', encoding='utf-8') as fout:
    fout.write(xyz.decode('utf-8'))

If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.

Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.

I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.

Upvotes: 2

pepr
pepr

Reputation: 20792

Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:

def parse(self, source, parser=None):
    if not hasattr(source, "read"):
        source = open(source, "rb")
    if not parser:
        parser = XMLParser(target=TreeBuilder())
    while 1:
        data = source.read(65536)
        if not data:
            break
        parser.feed(data)
    self._root = parser.close()
    return self._root

You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.

Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.

Upvotes: 1

jfs
jfs

Reputation: 414585

To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:

tree.write('file', encoding='utf-8')

Upvotes: 2

Blubber
Blubber

Reputation: 2254

I think this should work:

import codecs

with codecs.open("file.xml", 'w', "utf-8") as fout:
    # do stuff with filepointer

Upvotes: 2

Related Questions