Rose
Rose

Reputation: 81

How to remove stop words from a csv file

Currently I am working on a project which analyses Twitter data. I am in the pre-processing stage and am struggling to get my application to remove stop words from the dataset.

import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
                                                        title='Select file',
                                                        filetypes=(('csv file', '*.csv'),
                                                                   ('csv file', '*.csv')))

for tw in df["txt"]:

            column_list = ["txt"]
            clean_tw = []
            df = pd.read_csv(self.file_name, usecols=column_list)
            stop_words = set(stopwords.words('english'))

            tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
            if tw not in stop_words:
            filtered_tw = [w for w in tw if not w in stopwords.words('english')]
            clean_tw.append(filtered_tw)
                        

I currently get the error:

Exception in Tkinter callback
Traceback (most recent call last):
  File "...", line 1884, in __call__
    return self.func(*args)
  File "...", line 146, in clean_csv
    if tweet not in stop_words:
TypeError: unhashable type: 'list'

Upvotes: 1

Views: 2867

Answers (3)

Matt
Matt

Reputation: 1284

Just an FYI you should not be removing stopwords with regex when there are such great packages out there!

I recommend using nltk to tokenize and untokenize.

For each row in your csv:

import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

# get your stopwords from nltk
stop_words = set(stopwords.words('english'))

# loop through your rows
for sent in sents:

    # tokenize
    tokenized_sent = nltk.word_tokenize(sent)

    # remove stops
    tokenized_sent_no_stops = [
        tok for tok in tokenized_sent 
        if tok not in stop_words
    ]

    # untokenize 
    untokenized_sent = TreebankWordDetokenizer().detokenize(
        tokenized_sent_no_stops
    )

Upvotes: 1

ppanero
ppanero

Reputation: 327

You are trying to check if a list (the result from the regex) is in a set... this operation cannot be done. You need to loop through the list (or do some sort of set operation, e.g. set(tw).difference(stop_words).

Just for clarity:

>>> tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", initial.lower()).split())
>>> tw
['this', 'is', 'an', 'example']
>>> set(tw).difference(stop_words)
{'example'}

Then just append to clean_tw the difference :) Something like:

clean_tw = []
df = pd.read_csv(self.file_name, usecols=col_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))

Lastly, you can define the stop_words outside the loop since it is going to be always the same set, so you improve a bit on performance :)

Upvotes: 1

wwii
wwii

Reputation: 23773

Based on the error message, it is probable that tweet is a list and stop_words is a set or dictionary.

>>> tweet = ['a','b']
>>> stop_words = set('abcdefg')
>>> tweet not in stop_words
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Try this instead

if not stop_words.intersection(tweet):
    ...

or

if stop_words.isdisjoint(tweet):

Upvotes: 0

Related Questions