Reputation: 85
My data frame looks like -
State text
Delhi 170 kw for330wp, shipping and billing in delhi...
Gujarat 4kw rooftop setup for home Photovoltaic Solar...
Karnataka language barrier no requirements 1kw rooftop ...
Madhya Pradesh Business PartnerDisqualified Mailed questionna...
Maharashtra Rupdaypur, panskura(r.s) Purba Medinipur 150kw...
I want to remove punctuation and stop words from this data frame. I have done the following code. But its not working -
import nltk
nltk.download('stopwords')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re
def message_cleaning(message):
Test_punc_removed = [char for char in message if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
return Test_punc_removed_join_clean
df['text'] = df['text'].apply(message_cleaning)
AttributeError: 'set' object has no attribute 'words'
Upvotes: 1
Views: 1195
Reputation: 3483
Problem: I believe you have a name conflict for stopwords
. There is probably a line somewhere in your notebook where you assign:
stopwords = stopwords.words("english")
That would explain the issue, as calling stopwords
would turn ambiguous: you'd be referring to the variable and not the package anymore.
Solution: Make things unambiguous:
from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english"))
Test_punc_removed_join_clean = [
word for word in Test_punc_removed_join.split()
if word.lower() not in english_stop_words
]
Upvotes: 1