TheRealG
TheRealG

Reputation: 11

UnicodeDecodeError: 'charmap' codec can't decode byte Z in position Y: character maps to <undefined>

I'm attempting to perform sentiment analysis using a large training dataset. The problem is that when I perform the analysis using the 'sampleTweets.csv', everything turns out okay except that the analysis is not accurate because the sampleTweets dataset is too small.

When I use a larger dataset such as 'full_training_dataset.csv', I get the following error

return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6961: character maps to

I've tried adding encoding="utf-8" and other encoding such as latin-1 but when I do that, the program continues running without producing any result in the console.

The following is the code, this is a github link of the project: https://github.com/ravikiranj/twitter-sentiment-analyzer, I'm using the simpleDemo.py file.

#Read the tweets one by one and process it
inpTweets = csv.reader(open('data/full_training_dataset.csv', 'r'), delimiter=',', quotechar='|')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
count = 0
featureList = []
tweets = []
for row in inpTweets:
    sentiment = row[0]
    tweet = row[1]
    processedTweet = processTweet(tweet)
    featureVector = getFeatureVector(processedTweet, stopWords)
    featureList.extend(featureVector)
    tweets.append((featureVector, sentiment))

Upvotes: 1

Views: 704

Answers (1)

I know this is a old post but this worked for me.

  1. Go to your python installation:

    example: C:\Python\Python37-32\Lib\site-packages\stopwordsiso

  2. Open __init__.py

  3. Change with open(STOPWORDS_FILE) as json_data:

    to with open(STOPWORDS_FILE, encoding="utf8") as json_data:

Upvotes: 1

Related Questions