Reputation: 1307
In Python 2.7 on a Mac I'm printing file names retrieved with nltk's PlaintextCorpusReader:
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
print fileid
and get UnicodeDecodeError: 'ascii', '100316-N1-The \xc2\xa3250bn cost of developing.txt', 14, 15, 'ordinal not in range(128)'
because of the £
symbol in a filename.
As I understand things, fileid
is a unicode string which I need to encode to the default encoding before I can print it, and the default encoding is ASCII.
If I use print fileid.encode('ascii', 'ignore')
, I get the same error.
If I change the default encoding by setting encoding = "utf-8"
in site.py
, (per this advice) it works.
Can anyone tell me:
(a) why encode
has failed
(b) why encoding
works and
(c) what I should do if I'm doing something wrong here? (For example, this describes setting default encoding as 'an ugly hack' that leads to the misuse of strings and creation of buggy code.)
(Disclaimer: new to Python, very grateful for your patience if this is obvious)
=========================================== Update to respond to Rob:
Rob, here is the full text of the test code:
import sys
import os
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/richlyon/Documents/Filing/Infobase/'
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
print type(fileid) # result <type 'str'>
fileid = fileid.decode('utf8')
print type(fileid) # result <type 'unicode'>
print fileid.encode('ascii')
I've set default encoding back to ascii
and run it.
print fileid.encode('ascii')
still fails on £
in a filename.
=========================================== Last update in case this is of help to anyone else.
I needed to write:
fileid = fileid.decode('utf8')
print fileid.encode('ascii', 'ignore')
but text = nltk.Text(infobasecorpus.words(fileid))
chokes if it is fed <type 'unicode'>
strings, which seems to contradict the recommendation to immediately convert everything into unicode before further processing.
But now it works. Thanks all, and Rob in particular.
Upvotes: 2
Views: 1429
Reputation: 22619
Check the type of the fileid object. I suspect it is not a unicode object as you suggest. The UnicodeDecodeError
is being raised because of an implicit decode prior to python encoding the string for output (by print
).
Once the string is successfully decoded (to unicode), you can then print it by explicitly encoding it with a codec supported by your terminal. If your terminal supports the display of unicode, you may not need to encode it before output.
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
fileid = fileid.decode('utf8') ## fileid is now a unicode object
print fileid.encode('utf8')
Replace utf8
with whatever encoding is used by your filesystem (maybe latin1 on Windows?, not sure).
EDIT: Overriding the site-wide default encoding is considered a hack as it a) can hide programming issues which may mean your code is not portable across python installs and b) it can affect other code running from the same python installation. Further, being explicit about encoding and decoding your strings makes life easier when you return to your code later; You don't have to remember that you modified site.py
Upvotes: 2