Denly
Denly

Reputation: 949

UnicodeDecodeError while loading file in python

I'm running this:

news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
                                for f in news_train.filenames))

but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.

I checked the news_train.filenames. It is:

array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
       ..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'], 
      dtype='<U74')

Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!

p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

Upvotes: 0

Views: 947

Answers (2)

Philip Tzou
Philip Tzou

Reputation: 6438

Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").

open(f, 'rb').read()  # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read()  # returns latin1 decoded `str`

Upvotes: 0

Denly
Denly

Reputation: 949

Well I found the solution. Using

open(f, encoding = "latin1")

I'm not sure why it only happens on my mac though. Wish to know it.

Upvotes: 1

Related Questions