Reputation: 384
I have two data frames:
import pandas as pd
first_df = pd.DataFrame({'Full Name': ['Mulligan Nick & Mary', 'Tsang S C', 'Hattie J A C '],
'Address': ['270 Claude Road', '13 Sunnyridge Place', '18A Empire Road']})
second_df = pd.DataFrame({'Owner' : ['David James Mulligan', 'Brenda Joy Mulligan ', 'Helen Kwok Hattie'],
'Add Match': ['19 Dexter Avenue', 'Claude Road ', 'Building NO 512']})
Is there anyway to match only the first string in Full Name
column to the last string in Owner
column.
If there is a match, I then want to compare Address
against Add match
to see if there are any like values. If the first condition passes but the second condition fails, this would not be added into the new data frame.
Using a left join results in:
new_df = first_df.merge(second_df, how='left', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])
print(new_df.head())
Full Name Address Owner Add Match
0 Mulligan Nick & Mary 270 Claude Road NaN NaN
1 Tsang S C 13 Sunnyridge Place NaN NaN
2 Hattie J A C 18A Empire Road NaN NaN
However the output wanted would look more like this:
new_df
Name Address
---- --------
Brenda Joy Mulligan Claude Road
Upvotes: 0
Views: 107
Reputation: 887
Inspired by this answer you could employ a similar solution.
first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
df_final = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
address_matches = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[address_matches].drop(columns=['last_name', 'start_name', 'Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Initially, you extract the last name keys you want.
first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
PS: Here we are using built-in string methods from pandas/numpy combo given your instructions. But if it fits you better you could also apply the similarity methods (e.g., difflib.get_close_matches
) shown down below for the address part.
Next, you perform an inner join of these dataframes to match the last_name
key.
df_temp = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
Then you apply the difflib.get_close_matches
with the desired similarity (I used cutoff=0.8
because above this value there were no values returned) method to mark which rows contain matches and subsequently get only the rows you want.
matches_mask = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[matches_mask].drop(columns=['last_name', 'start_name'])
Full Name Address Owner Add Match
Mulligan Nick & Mary 270 Claude Road Brenda Joy Mulligan Claude Road
Finally, to get match the format of the results posted at the end of your question you drop or rename some columns.
df_final.drop(columns=['Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Owner Add Match
Brenda Joy Mulligan Claude Road
Upvotes: 1
Reputation: 13478
You could take advantage of the difflib
module from Python standard library to find similarities between different columns.
For instance, you can define the following function:
from difflib import SequenceMatcher
def compare_df(left, right, col: str):
left[f"{col}_match_ratio"] = 0
for value in left[col]:
best_ratio = 0
for other in right[col]:
result = SequenceMatcher(None, str(value), str(other)).ratio()
if result > best_ratio:
best_ratio = result
left.loc[left[col] == value, f"{col}_match_ratio"] = round(best_ratio, 2)
Then:
new_df = second_df.loc[second_df["Owner_match_ratio"] > 0.7, :]
Upvotes: 2