Reputation: 563
We have the following two data frames
temp = pd.DataFrame(np.array([['I am feeling very well',1],['It is hard to believe this happened',0],
['What is love?',1], ['No new friends',0],
['I love this show',1],['Amazing day today',1]]),
columns = ['message','sentiment'])
temp_truncated = pd.DataFrame(np.array([['I am feeling very',1],['It is hard to believe',1],
['What is',1], ['Amazing day',1]]),
columns = ['message','cutoff'])
My idea is to create a third DataFrame that would represent the inner join between temp
and temp_truncated
by finding matches in temp
that start with / contain the strings in temp_truncated
Desired Output:
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love 1 1
3 Amazing day today 1 1
Upvotes: 4
Views: 101
Reputation: 1887
You can use str.startswith
, or other str.
functions such as str.contains
in an apply to get a matches dataframe:
matches = temp_truncated.message.apply(
lambda x: temp[temp.message.str.startswith(x)]['sentiment']
).dropna(how='all')
This matches
dataframe contains the rows of temp_truncated
which has a match with one of the rows in temp
. These temp
rows are the columns of the matches
dataframe. The values are the sentiment values of these temp
rows.
With this you can filter the temp
dataframe with only matched rows and then enrich it with the corresponding cutoff value from temp_truncated
:
df = temp.iloc[matches.columns]
df.index = matches.index
df = df.merge(temp_truncated['cutoff'], left_index=True, right_index=True)
Result matches your desired output:
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love? 1 1
3 Amazing day today 1 1
Upvotes: 0
Reputation: 37737
Here is an approach using rapidfuzz
with pandas.merge
:
#pip install rapidfuzz
from rapidfuzz import process
out = (
temp_truncated
.assign(message_adapted = (temp_truncated['message']
.map(lambda x: process.extractOne(x, temp['message']))).str[0])
.merge(temp, left_on="message_adapted", right_on="message", how="left", suffixes=("_", ""))
.drop(columns=["message_adapted", "message_"])
.loc[:, temp.columns.tolist() + ["cutoff"]]
)
print(out)
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love? 1 1
3 Amazing day today 1 1
Upvotes: 1
Reputation: 260290
You can use:
import re
pattern = '|'.join(map(re.escape, temp_truncated['message']))
key = temp['message'].str.extract(f'({pattern})', expand=False)
out = (temp
.merge(temp_truncated.rename(columns={'message': 'sub'}),
left_on=key, right_on='sub')
.drop(columns='sub')
)
Output:
message sentiment cutoff
0 I am feeling very well 1 1
1 It is hard to believe this happened 0 1
2 What is love? 1 1
3 Amazing day today 1 1
Upvotes: 5