Reputation: 59
I am new to python. I have a .txt(size:15,259KB). I want to load the file and do something with it, but I keep getting the error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to "
import nltk
from nltk import FreqDist
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#Read the datasets
path = "C:\\tmp\\FILENAME.txt"
dataset={}
dataset_raw = {}
allFeatures=set()
tot_articles = 0
articles_count={}
N={} # Number of articles in each corpus
for category in categories:
fileName=path
f=open(fileName,'r')
text = ''
text_raw = ''
lines=(f.readlines())
tot_articles+=len(lines)
articles_count[category] = len(lines)
dataset_raw[category] = list(map(lambda line: line.lower(), lines))
for line in lines:
text+=line.replace('\n',' ').lower()
text_raw = line.lower()
f.close
N[category]=len(lines)
tokens = nltk.word_tokenize(text)
dataset[category] = nltk.Text(tokens)
Below is the error I got:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-14-222e94b75803> in <module>
14 text = ''
15 text_raw = ''
---> 16 lines=(f.readlines())
17 tot_articles+=len(lines)
18 articles_count[category] = len(lines)
~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>
Upvotes: 1
Views: 3991
Reputation: 897
Try specifying the encoding while opening the file:
For example:
f=open(fileName,'r', encoding="utf8")
Upvotes: 5