Basudev
Basudev

Reputation: 135

How to get the specific word fron str.contains

I have a pandas data frame with ID and text string. I am trying categorize the record with str.contains I need the word from the text string the str.contains code has identified in different columns.I am using python 3 and pandas My df is as follows:

ID  Text
1   The cricket world cup 2019 has begun
2   I am eagrly waiting for the cricket worldcup 2019 
3   I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019
4   I love cricket to watch and badminton to play


searchfor = ['cricket','world cup','2019']
 df['text'].str.contains('|'.join(searchfor))

ID  Text                                    phrase1 phrase2    phrase3
1   The cricket world cup 2019 has begun    cricket  world cup 2019
2   I am eagrly waiting for the 
cricket worldcup 2019                           cricket world cup   2019
3   I will try to watch all the mathes my 
favourite teams playing in the 
cricketworldcup 2019                           cricket  world cup   2019
4   I love cricket to watch and badminton 
to play                                        cricket

Upvotes: 0

Views: 100

Answers (2)

Piotr
Piotr

Reputation: 2117

The trick is to use str.findall instead of str.contains to get the list of all matched phrases. Then it is just a matter of munging the dataframe to the format you want.

Here is your starting point:

df = pd.DataFrame(
    [
        'The cricket world cup 2019 has begun',
        'I am eagrly waiting for the cricket worldcup 2019',
        'I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019',
        'I love cricket to watch and badminton to play',
    ],
    index=pd.Index(range(1, 5), name="ID"),
    columns=["Text"]
)
searchfor = ['cricket','world cup','2019']

And here is an example solution:

pattern = "(" + "|".join(searchfor) + ")"
matches = (
    df.Text.str.findall(pattern)
    .apply(pd.Series)
    .stack()
    .reset_index(-1, drop=True)
    .to_frame("phrase")
    .assign(match=True)
)

#        phrase  match
# ID                  
# 1     cricket   True
# 1   world cup   True
# 1        2019   True
# 2     cricket   True
# 2        2019   True
# 3     cricket   True
# 3        2019   True
# 4     cricket   True

You can also reformat the dataframe to have a separate column for each phrase:

matches.pivot(columns="phrase", values="match").fillna(False)

# phrase   2019  cricket  world cup
# ID                               
# 1        True     True       True
# 2        True     True      False
# 3        True     True      False
# 4       False     True      False

Upvotes: 1

Mohit Motwani
Mohit Motwani

Reputation: 4792

You can use np.where:

import numpy as np
search_for = ['cricket', 'world cup', '2019']

for word in search_for:
    df[word] = np.where(df.text.str.contains(word), word, np.nan)

df


     text                                               cricket    world cup    2019
1   The cricket world cup 2019 has begun                cricket    world cup    2019
2   I am eagrly waiting for the cricket worldcup 2019   cricket     nan         2019
3   I will try to watch all the mathes my favourit...   cricket     nan         2019
4   I love cricket to watch and badminton to play       cricket     nan         nan

Syntax of np.where: np.where(condition[, x, y]). If the condition is True, it returns x otherwise y

Upvotes: 1

Related Questions