Reputation: 949
I'm running this:
news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
for f in news_train.filenames))
but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.
I checked the news_train.filenames. It is:
array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'],
dtype='<U74')
Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!
p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py
Upvotes: 0
Views: 947
Reputation: 6438
Actually in Python 3+, the open
function opens and reads file in default mode 'r'
which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError
. The solution is either opening the files in binary mode ('rb'
), or specify the correct encoding (encoding="latin1"
).
open(f, 'rb').read() # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read() # returns latin1 decoded `str`
Upvotes: 0
Reputation: 949
Well I found the solution. Using
open(f, encoding = "latin1")
I'm not sure why it only happens on my mac though. Wish to know it.
Upvotes: 1