David542
David542

Reputation: 110382

Unicode decode error using codecs.open()

I have run into a character encoding problem as follows:

rating = 'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(

"""<?xml version="1.0" encoding="UTF-8"?>
   <ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating))

The error I get is:

  File "./assetshare.py", line 314, in write_file
    </ratings>""" % (values['rating_system'], rating))

I know that the encoding error is related to Barntillåten, because if I replace that word with test, the function works fine.

Why is this encoding error happening and what do I need to do to fix it?

Upvotes: 0

Views: 1847

Answers (3)

ecatmur
ecatmur

Reputation: 157414

In Python 2, codecs.open expects to read and write unicode objects. You're passing it a str.

The fix is to ensure that the data you pass it is unicode:

new_file.write((

"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating)
).decode('utf-8'))

If you use unicode literals (u"...") then Python will try to ensure that all data is unicode. Here it would be sufficient to have rating = u'Barntillåten':

rating = u'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(

"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating))

You can write into a codecs.open file a str object, but only if the str is encoded in the default encoding, which means that for safety that's only safe if the str is plain ASCII. The default encoding is and should be left as ASCII; see Changing default encoding of Python?

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799120

You need to use unicode literals.

u'...'
u"..."
u'''......'''
u"""......"""

Upvotes: 1

ephemient
ephemient

Reputation: 204926

rating must be a Unicode string in order to contain Unicode codepoints.

rating = u'Barntillåten'

Otherwise, in Python 2, the non-Unicode string 'Barntillåten' contains bytes (encoded with whatever your source encoding was), not codepoints.

Upvotes: 3

Related Questions