lucy
lucy

Reputation: 4506

Remove stopwords from dataframe

dataframe['Text'] = dataframe['Text'].apply(lambda x : ' '.join([item for item in string.split(x.lower()) if item not in stopwords]))

I am removing the stop words from the dataframe. Logic is working fine, but when there is some empty row comes it gives error.

I have used dropna() but it will drop the whole line instead there is data in other column.

How to add condition in above logic that column Text should not null

Upvotes: 1

Views: 1798

Answers (2)

jezrael
jezrael

Reputation: 863266

You can replace NaN to empty list what is not easy - use mask or combine_first by Series created by empty lists:

pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
(None, 'positive')] 

df = pd.DataFrame(pos_tweets, columns= ["Text","col2"])
print (df)
                                Text      col2
0                    I love this car  positive
1               This view is amazing  positive
2          I feel great this morning  positive
3  I am so excited about the concert  positive
4                               None  positive

stopwords =  ['love','car','amazing']
s = pd.Series([[]], index=df.index)
df["Text"] = df["Text"].str.lower().str.split().mask(df["Text"].isnull(), s)
print (df)
                                        Text      col2
0                       [i, love, this, car]  positive
1                  [this, view, is, amazing]  positive
2            [i, feel, great, this, morning]  positive
3  [i, am, so, excited, about, the, concert]  positive
4                                         []  positive

df['Text']=df['Text'].apply(lambda x:' '.join([item for item in x if item not in stopwords]))
print (df)
                                Text      col2
0                             i this  positive
1                       this view is  positive
2          i feel great this morning  positive
3  i am so excited about the concert  positive
4                                     positive

Another solution:

stopwords =  ['love','car','amazing']
df["Text"]=df["Text"].str.lower().str.split().combine_first(pd.Series([[]], index=df.index))
print (df)
                                        Text      col2
0                       [i, love, this, car]  positive
1                  [this, view, is, amazing]  positive
2            [i, feel, great, this, morning]  positive
3  [i, am, so, excited, about, the, concert]  positive
4                                         []  positive

Upvotes: 1

Kishore
Kishore

Reputation: 5891

use before your logic,

dataframe.dropna(subset=['Text'], how='all')

Upvotes: 1

Related Questions