Understanding unicode in Python and UnicodeDecodeError

Question

In Python 2.7 on a Mac I'm printing file names retrieved with nltk's PlaintextCorpusReader:

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    print fileid

and get UnicodeDecodeError: 'ascii', '100316-N1-The \xc2\xa3250bn cost of developing.txt', 14, 15, 'ordinal not in range(128)' because of the £ symbol in a filename.

As I understand things, fileid is a unicode string which I need to encode to the default encoding before I can print it, and the default encoding is ASCII.

If I use print fileid.encode('ascii', 'ignore'), I get the same error.

If I change the default encoding by setting encoding = "utf-8" in site.py, (per this advice) it works.

Can anyone tell me: (a) why encode has failed (b) why encoding works and (c) what I should do if I'm doing something wrong here? (For example, this describes setting default encoding as 'an ugly hack' that leads to the misuse of strings and creation of buggy code.)

(Disclaimer: new to Python, very grateful for your patience if this is obvious)

=========================================== Update to respond to Rob:

Rob, here is the full text of the test code:

import sys
import os
from nltk.corpus import PlaintextCorpusReader

corpus_root = '/Users/richlyon/Documents/Filing/Infobase/'
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')

for fileid in infobasecorpus.fileids():
    print type(fileid)             # result 
    fileid = fileid.decode('utf8')
    print type(fileid)             # result 
    print fileid.encode('ascii')

I've set default encoding back to ascii and run it.

print fileid.encode('ascii') still fails on £ in a filename.

=========================================== Last update in case this is of help to anyone else.

I needed to write:

fileid = fileid.decode('utf8')
print fileid.encode('ascii', 'ignore')

but text = nltk.Text(infobasecorpus.words(fileid)) chokes if it is fed strings, which seems to contradict the recommendation to immediately convert everything into unicode before further processing.

But now it works. Thanks all, and Rob in particular.

Rob Cowie · Accepted Answer

Check the type of the fileid object. I suspect it is not a unicode object as you suggest. The UnicodeDecodeError is being raised because of an implicit decode prior to python encoding the string for output (by print).

Once the string is successfully decoded (to unicode), you can then print it by explicitly encoding it with a codec supported by your terminal. If your terminal supports the display of unicode, you may not need to encode it before output.

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    fileid = fileid.decode('utf8') ## fileid is now a unicode object
    print fileid.encode('utf8')

Replace utf8 with whatever encoding is used by your filesystem (maybe latin1 on Windows?, not sure).

EDIT: Overriding the site-wide default encoding is considered a hack as it a) can hide programming issues which may mean your code is not portable across python installs and b) it can affect other code running from the same python installation. Further, being explicit about encoding and decoding your strings makes life easier when you return to your code later; You don't have to remember that you modified site.py

Understanding unicode in Python and UnicodeDecodeError

Answers (1)

Related Questions