Remove punctuation and stop words from a data frame

Question

My data frame looks like -

State                           text
Delhi                  170 kw for330wp, shipping and billing in delhi...
Gujarat                4kw rooftop setup for home Photovoltaic Solar...
Karnataka              language barrier no requirements 1kw rooftop ...
Madhya Pradesh         Business PartnerDisqualified Mailed questionna...
Maharashtra            Rupdaypur, panskura(r.s) Purba Medinipur 150kw...

I want to remove punctuation and stop words from this data frame. I have done the following code. But its not working -

import nltk
nltk.download('stopwords')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

df['text'] = df['text'].apply(message_cleaning)

AttributeError: 'set' object has no attribute 'words'

arnaud · Accepted Answer

Problem: I believe you have a name conflict for stopwords. There is probably a line somewhere in your notebook where you assign:

stopwords = stopwords.words("english")

That would explain the issue, as calling stopwords would turn ambiguous: you'd be referring to the variable and not the package anymore.

Solution: Make things unambiguous:

First assign a variable referring to stop words (that'll be faster than calling it everytime btw)

from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))

Use that in your function:

Test_punc_removed_join_clean = [
    word for word in Test_punc_removed_join.split() 
    if word.lower() not in english_stop_words
]

Remove punctuation and stop words from a data frame

Answers (1)

Related Questions