Reputation: 71
I am trying to read a csv file - it has 3 million tweets in it. Eventually, I want to remove the stop words, and get the top 2,000 unique words along with their frequencies. However, I am running into an error well before I get to that point. Here is my code:
import nltk
from nltk.corpus import stopwords
import csv
f = open("/Users/shannonmcgregor/Desktop/ShanTweets.csv")
shannon_sample_tweets = f.read()
f.close()
filtered_tweets = [w for w in shannon_sample_tweets if not w in stopwords.words('english')]
And the error I get after I run that is:
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Can anyone help me figure out what is going wrong? I did put # -*- coding: utf-8 -*-,
in the top of my source code
Upvotes: 0
Views: 4512
Reputation: 27612
Good, your comment clears things up. To get your csv into unicode, you should run: import codecs
then:
f = codecs.open("/Users/shannonmcgregor/Desktop/ShanTweets.csv","r","utf-8")
Then if you recheck the type of your csv, you should see unicode. This is of course assuming your Tweets are utf-8 compliant, which appeared to be the case (I took a quick peek!). If you plan on working with strings in Python, I recommend reading up on encodings--they will become important for your work.
Upvotes: 1