DarknessPlusPlus
DarknessPlusPlus

Reputation: 563

Inner merge two DataFrames on string partial match

We have the following two data frames

temp = pd.DataFrame(np.array([['I am feeling very well',1],['It is hard to believe this happened',0],
                                  ['What is love?',1], ['No new friends',0],
                             ['I love this show',1],['Amazing day today',1]]),
                                columns = ['message','sentiment'])

temp_truncated = pd.DataFrame(np.array([['I am feeling very',1],['It is hard to believe',1],
                                  ['What is',1], ['Amazing day',1]]),
                                columns = ['message','cutoff'])

My idea is to create a third DataFrame that would represent the inner join between temp and temp_truncated by finding matches in temp that start with / contain the strings in temp_truncated

Desired Output:

     message                             sentiment   cutoff            
0    I am feeling very well               1          1
1    It is hard to believe this happened  0          1
2    What is love                         1          1
3    Amazing day today                    1          1

Upvotes: 4

Views: 101

Answers (3)

Paul
Paul

Reputation: 1887

You can use str.startswith, or other str. functions such as str.contains in an apply to get a matches dataframe:

matches = temp_truncated.message.apply(
    lambda x: temp[temp.message.str.startswith(x)]['sentiment']
).dropna(how='all')

This matches dataframe contains the rows of temp_truncated which has a match with one of the rows in temp. These temp rows are the columns of the matches dataframe. The values are the sentiment values of these temp rows. With this you can filter the temp dataframe with only matched rows and then enrich it with the corresponding cutoff value from temp_truncated:

df = temp.iloc[matches.columns]
df.index = matches.index
df = df.merge(temp_truncated['cutoff'], left_index=True, right_index=True)

Result matches your desired output:

    message                             sentiment   cutoff
0   I am feeling very well              1           1
1   It is hard to believe this happened 0           1
2   What is love?                       1           1
3   Amazing day today                   1           1

Upvotes: 0

Timeless
Timeless

Reputation: 37737

Here is an approach using rapidfuzz with pandas.merge :

#pip install rapidfuzz
from rapidfuzz import process

out = (
        temp_truncated
           .assign(message_adapted = (temp_truncated['message']
                                       .map(lambda x: process.extractOne(x, temp['message']))).str[0])
           .merge(temp, left_on="message_adapted", right_on="message", how="left", suffixes=("_", ""))
           .drop(columns=["message_adapted", "message_"])
           .loc[:, temp.columns.tolist() + ["cutoff"]]
      )

# Output :

print(out)
                               message sentiment cutoff
0               I am feeling very well         1      1
1  It is hard to believe this happened         0      1
2                        What is love?         1      1
3                    Amazing day today         1      1

Upvotes: 1

mozway
mozway

Reputation: 260290

You can use:

import re
pattern = '|'.join(map(re.escape, temp_truncated['message']))

key = temp['message'].str.extract(f'({pattern})', expand=False)

out = (temp
 .merge(temp_truncated.rename(columns={'message': 'sub'}),
        left_on=key, right_on='sub')
 .drop(columns='sub')
)

Output:

                               message sentiment cutoff
0               I am feeling very well         1      1
1  It is hard to believe this happened         0      1
2                        What is love?         1      1
3                    Amazing day today         1      1

Upvotes: 5

Related Questions