lsama
lsama

Reputation: 319

More efficient way to find multiple keywords in column of strings pandas

I have a dataframe containing many rows of strings: btb['Title']. I would like to identify whether each string contains positive, negative or neutral keywords. The following works but is considerably slow:

positive_kw =('rise','positive','high','surge')
negative_kw = ('sink','lower','fall','drop','slip','loss','losses')
neutral_kw = ('flat','neutral')
#create new columns, turn value to one if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan

#Turn value to one if keyword exists in sentence
for index, row in btb.iterrows():
    if any(s in row.Title for s in positive_kw) == True:
        btb['Positive'].loc[index] = 1
    if any(s in row.Title for s in negative_kw) == True:
        btb['Negative'].loc[index] = 1
    if any(s in row.Title for s in neutral_kw) == True:
        btb['Neutral'].loc[index] = 1

I appreciate your time and am intested to see what is necessary to improve the performance of this code

Upvotes: 4

Views: 3783

Answers (1)

MattR0se
MattR0se

Reputation: 354

You can use '|'.join on a list of words to create a regex pattern which matches any of the words (at least one) Then you can use the pandas.Series.str.contains() method to create a boolean mask for the matches.

import pandas as pd

# create regex pattern out of the list of words
positive_kw = '|'.join(['rise','positive','high','surge'])
negative_kw = '|'.join(['sink','lower','fall','drop','slip','loss','losses'])
neutral_kw = '|'.join(['flat','neutral'])

# creating some fake data for demonstration
words = [
        'rise high',
        'positive attitude',
        'something',
        'foo',
        'lowercase',
        'flat earth',
        'neutral opinion'
        ]

df = pd.DataFrame(data=words, columns=['words'])

df['positive'] = df['words'].str.contains(positive_kw).astype(int)
df['negative'] = df['words'].str.contains(negative_kw).astype(int)
df['neutral'] = df['words'].str.contains(neutral_kw).astype(int)

print(df)

Output:

               words  positive  negative  neutral
0          rise high         1         0        0
1  positive attitude         1         0        0
2          something         0         0        0
3                foo         0         0        0
4          lowercase         0         1        0
5         flat earth         0         0        1
6    neutral opinion         0         0        1

Upvotes: 5

Related Questions