connor449
connor449

Reputation: 1679

Looping through list and row for keyword match in pandas dataframe

I have a dataframe that looks like this. It has 1 column labeled 'utterances'. df.utterances contains rows whose values are strings of n number words.

  
                             utterances
0                                        okay go ahead.
1                                     Um, let me think.
2     nan that's not very encouraging. If they had a...
3     they wouldn't make you want to do it. nan nan ...
4     Yeah. The problem is though, it just, if we pu...

I also have a list of specific words. It is called specific_words. It looks like this:

specific_words = ['happy, 'good', 'encouraging', 'joyful']

I want to check if any of the words from specific_words are found in any of the utterances. Essentially, I want to loop throughevery row in df.utterance, and when I do so, loop through specific_list to look for matches. If there is a match, I want to have a boolean column next to df.utterances that shows this.

def query_text_by_keyword(df, word_list):
    for word in word_list:
        for utt in df.utterance:
            if word in utt:
                match = True
            else:
                match = False
            return match
    
df['query_match'] = df.apply(query_text_by_keyword, 
                                               axis=1, 
                                               args=(specific_words,))

It doesn't break, but it just returns False for every row, when it shouldn't. For example, the first few rows should look like this:

 utterances                                                    query_match
    0                                        okay go ahead.       False
    1                                     Um, let me think.       False
    2     nan that's not very encouraging. If they had a...       True
    3     they wouldn't make you want to do it. nan nan ...       False
    4     Yeah. The problem is though, it just, if we pu...       False

Edit

@furas made a great suggestion to solve the initial question. However, I would also like to add another column that contains the specific word(s) from the query that indicates a match. Example:

 utterances                                                 query_match   word  
    0                                    okay go ahead    False      NaN
    1                                 Um, let me think    False      NaN
    2 nan that's not very encouraging. If they had a..    True   'encouraging'
    3 they wouldn't make you want to do it. nan nan ..    False      NaN
    4 Yeah. The problem is though, it just, if we pu..    False      NaN

Upvotes: 0

Views: 2209

Answers (1)

furas
furas

Reputation: 142661

You can use regex with str.contains(regex)

df['utterances'].str.constains("happy|good|encouraging|joyful")

You can create this regex with

query = '|'.join(specific_words)

You can also use str.lower() because strings may have uppercase chars.

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.lower().str.contains(query)

print(df)

Result

                                          utterances  query_match
0                                      okay go ahead        False
1                                  Um, let me think.        False
2  nan that's not very encouraging. If they had a...         True
3  they wouldn't make you want to do it. nan nan ...        False
4  Yeah. The problem is though, it just, if we pu...        False

EDIT: as @HenryYik mentioned in comment you can use case=False instead of str.lower()

df['query_match'] = df['utterances'].str.contains(query, case=False)

More in doc: pandas.Series.str.contains


EDIT: to get matching word you ca use str.extract() with regex in (...)

df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )

Working example:

import pandas as pd

df = pd.DataFrame({
    'utterances':[
        'okay go ahead',
        'Um, let me think.',
        'nan that\'s not very encouraging. If they had a...',
        'they wouldn\'t make you want to do it. nan nan ...',
        'Yeah. The problem is though, it just, if we pu...',
        'Yeah. happy good',
    ]
})

specific_words = ['happy', 'good', 'encouraging', 'joyful']

query = '|'.join(specific_words)

df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )

print(df)

In example I added 'Yeah. happy good' to test which word will be returned happy or good. It returns first matching word.

Result:

                                          utterances  query_match         word
0                                      okay go ahead        False          NaN
1                                  Um, let me think.        False          NaN
2  nan that's not very encouraging. If they had a...         True  encouraging
3  they wouldn't make you want to do it. nan nan ...        False          NaN
4  Yeah. The problem is though, it just, if we pu...        False          NaN
5                                   Yeah. happy good         True        happy

BTW: now you can even do

df['query_match'] = ~df['word'].isna()

or

df['query_match'] = df['word'].notna()

Upvotes: 2

Related Questions