Reputation: 672
I have two data frames df1 and df2 as shown below:
Df1:
movie correct_id
0 birdman N/A
1 avengers: endgame N/A
2 deadpool N/A
3 once upon deadpool N/A
Df2: data frame of reference
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
Expected Result:
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool N/A
3 once upon deadpool 1
Please how do I merge two dataframes based on partial string match?
NB: The movie's name not exactly the same
Upvotes: 0
Views: 368
Reputation: 120391
From a previous post.
Input data:
>>> df1
movie correct_id
0 birdman NaN
1 avengers: endgame NaN
2 deadpool NaN
3 once upon deadpool NaN
>>> df2
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
A bit of fuzzy logic:
from fuzzywuzzy import process
dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
.tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
movie ratio best_id
0 birdmans 93 0
1 The avengers: endgame: endgame 90 1
2 once upon a deadpool 90 3
3 once upon a deadpool 95 3
The index of dfm
is the index of df1
rather than the column best_id
is the index of df2
. Now you can update your first dataframe:
THRESHOLD = 90 # adjust this number
ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool <NA>
3 once upon deadpool 1
Upvotes: 1