Reputation: 135
I have a pandas data frame with ID and text string. I am trying categorize the record with str.contains I need the word from the text string the str.contains code has identified in different columns.I am using python 3 and pandas My df is as follows:
ID Text
1 The cricket world cup 2019 has begun
2 I am eagrly waiting for the cricket worldcup 2019
3 I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019
4 I love cricket to watch and badminton to play
searchfor = ['cricket','world cup','2019']
df['text'].str.contains('|'.join(searchfor))
ID Text phrase1 phrase2 phrase3
1 The cricket world cup 2019 has begun cricket world cup 2019
2 I am eagrly waiting for the
cricket worldcup 2019 cricket world cup 2019
3 I will try to watch all the mathes my
favourite teams playing in the
cricketworldcup 2019 cricket world cup 2019
4 I love cricket to watch and badminton
to play cricket
Upvotes: 0
Views: 100
Reputation: 2117
The trick is to use str.findall
instead of str.contains
to get the list of all matched phrases. Then it is just a matter of munging the dataframe to the format you want.
Here is your starting point:
df = pd.DataFrame(
[
'The cricket world cup 2019 has begun',
'I am eagrly waiting for the cricket worldcup 2019',
'I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019',
'I love cricket to watch and badminton to play',
],
index=pd.Index(range(1, 5), name="ID"),
columns=["Text"]
)
searchfor = ['cricket','world cup','2019']
And here is an example solution:
pattern = "(" + "|".join(searchfor) + ")"
matches = (
df.Text.str.findall(pattern)
.apply(pd.Series)
.stack()
.reset_index(-1, drop=True)
.to_frame("phrase")
.assign(match=True)
)
# phrase match
# ID
# 1 cricket True
# 1 world cup True
# 1 2019 True
# 2 cricket True
# 2 2019 True
# 3 cricket True
# 3 2019 True
# 4 cricket True
You can also reformat the dataframe to have a separate column for each phrase:
matches.pivot(columns="phrase", values="match").fillna(False)
# phrase 2019 cricket world cup
# ID
# 1 True True True
# 2 True True False
# 3 True True False
# 4 False True False
Upvotes: 1
Reputation: 4792
You can use np.where:
import numpy as np
search_for = ['cricket', 'world cup', '2019']
for word in search_for:
df[word] = np.where(df.text.str.contains(word), word, np.nan)
df
text cricket world cup 2019
1 The cricket world cup 2019 has begun cricket world cup 2019
2 I am eagrly waiting for the cricket worldcup 2019 cricket nan 2019
3 I will try to watch all the mathes my favourit... cricket nan 2019
4 I love cricket to watch and badminton to play cricket nan nan
Syntax of np.where
: np.where(condition[, x, y])
. If the condition is True, it returns x otherwise y
Upvotes: 1