Reputation: 11
I'm attempting to perform sentiment analysis using a large training dataset. The problem is that when I perform the analysis using the 'sampleTweets.csv', everything turns out okay except that the analysis is not accurate because the sampleTweets dataset is too small.
When I use a larger dataset such as 'full_training_dataset.csv', I get the following error
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6961: character maps to
I've tried adding encoding="utf-8" and other encoding such as latin-1 but when I do that, the program continues running without producing any result in the console.
The following is the code, this is a github link of the project: https://github.com/ravikiranj/twitter-sentiment-analyzer, I'm using the simpleDemo.py file.
#Read the tweets one by one and process it
inpTweets = csv.reader(open('data/full_training_dataset.csv', 'r'), delimiter=',', quotechar='|')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
count = 0
featureList = []
tweets = []
for row in inpTweets:
sentiment = row[0]
tweet = row[1]
processedTweet = processTweet(tweet)
featureVector = getFeatureVector(processedTweet, stopWords)
featureList.extend(featureVector)
tweets.append((featureVector, sentiment))
Upvotes: 1
Views: 704
Reputation: 11
I know this is a old post but this worked for me.
Go to your python installation:
example: C:\Python\Python37-32\Lib\site-packages\stopwordsiso
Open __init__.py
Change with open(STOPWORDS_FILE) as json_data:
to with open(STOPWORDS_FILE, encoding="utf8") as json_data:
Upvotes: 1