Reputation: 1679
I have a dataframe that looks like this. It has 1 column labeled 'utterances'. df.utterances
contains rows whose values are strings of n number words.
utterances
0 okay go ahead.
1 Um, let me think.
2 nan that's not very encouraging. If they had a...
3 they wouldn't make you want to do it. nan nan ...
4 Yeah. The problem is though, it just, if we pu...
I also have a list of specific words. It is called specific_words
. It looks like this:
specific_words = ['happy, 'good', 'encouraging', 'joyful']
I want to check if any of the words from specific_words
are found in any of the utterances. Essentially, I want to loop throughevery row in df.utterance
, and when I do so, loop through specific_list
to look for matches. If there is a match, I want to have a boolean column next to df.utterances that shows this.
def query_text_by_keyword(df, word_list):
for word in word_list:
for utt in df.utterance:
if word in utt:
match = True
else:
match = False
return match
df['query_match'] = df.apply(query_text_by_keyword,
axis=1,
args=(specific_words,))
It doesn't break, but it just returns False for every row, when it shouldn't. For example, the first few rows should look like this:
utterances query_match
0 okay go ahead. False
1 Um, let me think. False
2 nan that's not very encouraging. If they had a... True
3 they wouldn't make you want to do it. nan nan ... False
4 Yeah. The problem is though, it just, if we pu... False
@furas made a great suggestion to solve the initial question. However, I would also like to add another column that contains the specific word(s) from the query that indicates a match. Example:
utterances query_match word
0 okay go ahead False NaN
1 Um, let me think False NaN
2 nan that's not very encouraging. If they had a.. True 'encouraging'
3 they wouldn't make you want to do it. nan nan .. False NaN
4 Yeah. The problem is though, it just, if we pu.. False NaN
Upvotes: 0
Views: 2209
Reputation: 142661
You can use regex
with str.contains(regex)
df['utterances'].str.constains("happy|good|encouraging|joyful")
You can create this regex
with
query = '|'.join(specific_words)
You can also use str.lower()
because strings may have uppercase chars.
import pandas as pd
df = pd.DataFrame({
'utterances':[
'okay go ahead',
'Um, let me think.',
'nan that\'s not very encouraging. If they had a...',
'they wouldn\'t make you want to do it. nan nan ...',
'Yeah. The problem is though, it just, if we pu...',
]
})
specific_words = ['happy', 'good', 'encouraging', 'joyful']
query = '|'.join(specific_words)
df['query_match'] = df['utterances'].str.lower().str.contains(query)
print(df)
Result
utterances query_match
0 okay go ahead False
1 Um, let me think. False
2 nan that's not very encouraging. If they had a... True
3 they wouldn't make you want to do it. nan nan ... False
4 Yeah. The problem is though, it just, if we pu... False
EDIT: as @HenryYik mentioned in comment you can use case=False
instead of str.lower()
df['query_match'] = df['utterances'].str.contains(query, case=False)
More in doc: pandas.Series.str.contains
EDIT: to get matching word you ca use str.extract()
with regex
in (...)
df['word'] = df['utterances'].str.extract( "(happy|good|encouraging|joyful)" )
Working example:
import pandas as pd
df = pd.DataFrame({
'utterances':[
'okay go ahead',
'Um, let me think.',
'nan that\'s not very encouraging. If they had a...',
'they wouldn\'t make you want to do it. nan nan ...',
'Yeah. The problem is though, it just, if we pu...',
'Yeah. happy good',
]
})
specific_words = ['happy', 'good', 'encouraging', 'joyful']
query = '|'.join(specific_words)
df['query_match'] = df['utterances'].str.contains(query, case=False)
df['word'] = df['utterances'].str.extract( '({})'.format(query) )
print(df)
In example I added 'Yeah. happy good'
to test which word will be returned happy
or good
. It returns first matching word.
Result:
utterances query_match word
0 okay go ahead False NaN
1 Um, let me think. False NaN
2 nan that's not very encouraging. If they had a... True encouraging
3 they wouldn't make you want to do it. nan nan ... False NaN
4 Yeah. The problem is though, it just, if we pu... False NaN
5 Yeah. happy good True happy
BTW: now you can even do
df['query_match'] = ~df['word'].isna()
or
df['query_match'] = df['word'].notna()
Upvotes: 2