Reputation: 81
Currently I am working on a project which analyses Twitter data. I am in the pre-processing stage and am struggling to get my application to remove stop words from the dataset.
import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
title='Select file',
filetypes=(('csv file', '*.csv'),
('csv file', '*.csv')))
for tw in df["txt"]:
column_list = ["txt"]
clean_tw = []
df = pd.read_csv(self.file_name, usecols=column_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
if tw not in stop_words:
filtered_tw = [w for w in tw if not w in stopwords.words('english')]
clean_tw.append(filtered_tw)
I currently get the error:
Exception in Tkinter callback
Traceback (most recent call last):
File "...", line 1884, in __call__
return self.func(*args)
File "...", line 146, in clean_csv
if tweet not in stop_words:
TypeError: unhashable type: 'list'
Upvotes: 1
Views: 2867
Reputation: 1284
Just an FYI you should not be removing stopwords with regex when there are such great packages out there!
I recommend using nltk
to tokenize and untokenize.
For each row in your csv:
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
# get your stopwords from nltk
stop_words = set(stopwords.words('english'))
# loop through your rows
for sent in sents:
# tokenize
tokenized_sent = nltk.word_tokenize(sent)
# remove stops
tokenized_sent_no_stops = [
tok for tok in tokenized_sent
if tok not in stop_words
]
# untokenize
untokenized_sent = TreebankWordDetokenizer().detokenize(
tokenized_sent_no_stops
)
Upvotes: 1
Reputation: 327
You are trying to check if a list (the result from the regex) is in a set... this operation cannot be done. You need to loop through the list (or do some sort of set operation, e.g. set(tw).difference(stop_words)
.
Just for clarity:
>>> tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", initial.lower()).split())
>>> tw
['this', 'is', 'an', 'example']
>>> set(tw).difference(stop_words)
{'example'}
Then just append to clean_tw
the difference :) Something like:
clean_tw = []
df = pd.read_csv(self.file_name, usecols=col_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))
Lastly, you can define the stop_words
outside the loop since it is going to be always the same set, so you improve a bit on performance :)
Upvotes: 1
Reputation: 23773
Based on the error message, it is probable that tweet
is a list and stop_words
is a set or dictionary.
>>> tweet = ['a','b']
>>> stop_words = set('abcdefg')
>>> tweet not in stop_words
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Try this instead
if not stop_words.intersection(tweet):
...
or
if stop_words.isdisjoint(tweet):
Upvotes: 0