Reputation: 319
I have a dataframe containing many rows of strings: btb['Title']. I would like to identify whether each string contains positive, negative or neutral keywords. The following works but is considerably slow:
positive_kw =('rise','positive','high','surge')
negative_kw = ('sink','lower','fall','drop','slip','loss','losses')
neutral_kw = ('flat','neutral')
#create new columns, turn value to one if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan
#Turn value to one if keyword exists in sentence
for index, row in btb.iterrows():
if any(s in row.Title for s in positive_kw) == True:
btb['Positive'].loc[index] = 1
if any(s in row.Title for s in negative_kw) == True:
btb['Negative'].loc[index] = 1
if any(s in row.Title for s in neutral_kw) == True:
btb['Neutral'].loc[index] = 1
I appreciate your time and am intested to see what is necessary to improve the performance of this code
Upvotes: 4
Views: 3783
Reputation: 354
You can use '|'.join
on a list of words to create a regex pattern which matches any of the words (at least one)
Then you can use the pandas.Series.str.contains()
method to create a boolean mask for the matches.
import pandas as pd
# create regex pattern out of the list of words
positive_kw = '|'.join(['rise','positive','high','surge'])
negative_kw = '|'.join(['sink','lower','fall','drop','slip','loss','losses'])
neutral_kw = '|'.join(['flat','neutral'])
# creating some fake data for demonstration
words = [
'rise high',
'positive attitude',
'something',
'foo',
'lowercase',
'flat earth',
'neutral opinion'
]
df = pd.DataFrame(data=words, columns=['words'])
df['positive'] = df['words'].str.contains(positive_kw).astype(int)
df['negative'] = df['words'].str.contains(negative_kw).astype(int)
df['neutral'] = df['words'].str.contains(neutral_kw).astype(int)
print(df)
Output:
words positive negative neutral
0 rise high 1 0 0
1 positive attitude 1 0 0
2 something 0 0 0
3 foo 0 0 0
4 lowercase 0 1 0
5 flat earth 0 0 1
6 neutral opinion 0 0 1
Upvotes: 5