Nicole
Nicole

Reputation: 59

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>

I am new to python. I have a .txt(size:15,259KB). I want to load the file and do something with it, but I keep getting the error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to "

import nltk
from nltk import FreqDist
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#Read the datasets
path = "C:\\tmp\\FILENAME.txt"
dataset={}
dataset_raw = {}
allFeatures=set()
tot_articles = 0
articles_count={}

N={} # Number of articles in each corpus

for category in categories:
    fileName=path
    f=open(fileName,'r')
    text = ''
    text_raw = ''    
    lines=(f.readlines())
    tot_articles+=len(lines)
    articles_count[category] = len(lines)
    dataset_raw[category] = list(map(lambda line: line.lower(), lines))

    for line in lines:
        text+=line.replace('\n',' ').lower()
        text_raw = line.lower()
    f.close
    N[category]=len(lines)

    tokens = nltk.word_tokenize(text)
    dataset[category] = nltk.Text(tokens)

Below is the error I got:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-14-222e94b75803> in <module>
     14     text = ''
     15     text_raw = ''
---> 16     lines=(f.readlines())
     17     tot_articles+=len(lines)
     18     articles_count[category] = len(lines)

~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>

Upvotes: 1

Views: 3991

Answers (1)

Amit Kumar
Amit Kumar

Reputation: 897

Try specifying the encoding while opening the file:

For example:

f=open(fileName,'r', encoding="utf8")

Upvotes: 5

Related Questions