Jimmy
Jimmy

Reputation: 182

Pandas find all words from row in dataframe match with list

I have a dict of emotions (anger, fear, anticipation, trust, etc...) with words associated to the emotions

anticipationlist:

{'anticipation': ['abundance',
          'opera',
          'star',
          'start',
          'achievement',
          'acquiring',...]

And, I have a dataframe with rows of sentences.I want to find the words that associated to the emotion

| text                          |
|---------------------------    |
| operation start yesterday     |
| operation start now           |
| operation halt                |

Expected output

| text                          | result        |
|---------------------------    |-------------  |
| operation start yesterday     | start         |
| operation start now           | start         |
| operation achievement         | achievement   |

I tried

df['result']=df["text"].str.findall(r"\b"+"|".join(anticipationlist) +r"\b").apply(", ".join)

my result is

| text                          | result                |
|---------------------------    |--------------------   |
| operation start yesterday     | opera, star           |
| operation start now           | opera, star           |
| operation achievement         | opera, achievement    |

How to improve my code to get my desired outcome?

Upvotes: 1

Views: 875

Answers (2)

James Tollefson
James Tollefson

Reputation: 893

Here's an approach that doesn't use regex. Also, I changed your anticipationlist from a dict to a list.

import pandas as pd

anticipationlist= ['abundance',
                    'opera',
                    'star',
                    'start',
                    'achievement',
                    'acquiring',
                    ]

values = [
    'operation start yesterday',
    'operation start now',
    'operation achievement',
    ]
df = pd.DataFrame(data=values, columns=['text'])

def find_values(x):
    results = []
    for value in anticipationlist:
        for word in x.split():
            if word == value:
                results.append(word)
    return ' '.join(results)
df['result'] = df['text'].apply(lambda x: find_values(x))

print(df.head())

Upvotes: 0

jezrael
jezrael

Reputation: 862406

You can add words boundaries for each value separately:

pat = '|'.join(r"\b{}\b".format(x) for x in anticipationlist)
df['result']=df["text"].str.findall(pat).apply(", ".join)

print (df)
                        text       result
0  operation start yesterday        start
1        operation start now        start
2      operation achievement  achievement

Upvotes: 1

Related Questions