Reading csv file, remove stop words, find unique words

Question

I am trying to read a csv file - it has 3 million tweets in it. Eventually, I want to remove the stop words, and get the top 2,000 unique words along with their frequencies. However, I am running into an error well before I get to that point. Here is my code:

import nltk
from nltk.corpus import stopwords
import csv

f = open("/Users/shannonmcgregor/Desktop/ShanTweets.csv")
shannon_sample_tweets = f.read()
f.close()

filtered_tweets = [w for w in shannon_sample_tweets if not w in stopwords.words('english')]

And the error I get after I run that is:

__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

Can anyone help me figure out what is going wrong? I did put # -*- coding: utf-8 -*-, in the top of my source code

duhaime · Accepted Answer

Good, your comment clears things up. To get your csv into unicode, you should run: import codecs then:

f = codecs.open("/Users/shannonmcgregor/Desktop/ShanTweets.csv","r","utf-8")

Then if you recheck the type of your csv, you should see unicode. This is of course assuming your Tweets are utf-8 compliant, which appeared to be the case (I took a quick peek!). If you plan on working with strings in Python, I recommend reading up on encodings--they will become important for your work.

Reading csv file, remove stop words, find unique words

Answers (1)

Related Questions