Check string values in one column df to another column df with multiple conditions

Question

I have two data frames:

import pandas as pd

first_df = pd.DataFrame({'Full Name': ['Mulligan Nick & Mary', 'Tsang S C', 'Hattie J A C '],
                         'Address': ['270 Claude Road', '13 Sunnyridge Place', '18A Empire Road']})

second_df = pd.DataFrame({'Owner' : ['David James Mulligan', 'Brenda Joy Mulligan ', 'Helen Kwok Hattie'],
                          'Add Match': ['19 Dexter Avenue', 'Claude Road ', 'Building NO 512']})

Is there anyway to match only the first string in Full Name column to the last string in Owner column.

If there is a match, I then want to compare Address against Add match to see if there are any like values. If the first condition passes but the second condition fails, this would not be added into the new data frame.

Using a left join results in:

new_df = first_df.merge(second_df, how='left', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])
print(new_df.head())

              Full Name              Address Owner Add Match
0  Mulligan Nick & Mary      270 Claude Road   NaN       NaN
1             Tsang S C  13 Sunnyridge Place   NaN       NaN
2         Hattie J A C       18A Empire Road   NaN       NaN

However the output wanted would look more like this:

new_df

Name                 Address
----                 --------
Brenda Joy Mulligan  Claude Road

xicocaio · Accepted Answer

Inspired by this answer you could employ a similar solution.

TL;DR

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
df_final = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
address_matches = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[address_matches].drop(columns=['last_name', 'start_name', 'Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})

Step-by-step

Initially, you extract the last name keys you want.

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]

PS: Here we are using built-in string methods from pandas/numpy combo given your instructions. But if it fits you better you could also apply the similarity methods (e.g., difflib.get_close_matches) shown down below for the address part.

Next, you perform an inner join of these dataframes to match the last_name key.

df_temp = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])

Then you apply the difflib.get_close_matches with the desired similarity (I used cutoff=0.8 because above this value there were no values returned) method to mark which rows contain matches and subsequently get only the rows you want.

matches_mask = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[matches_mask].drop(columns=['last_name', 'start_name'])

Full Name               Address             Owner                   Add Match

Mulligan Nick & Mary    270 Claude Road     Brenda Joy Mulligan     Claude Road

Finally, to get match the format of the results posted at the end of your question you drop or rename some columns.

df_final.drop(columns=['Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})

Owner                   Add Match

Brenda Joy Mulligan     Claude Road

Check string values in one column df to another column df with multiple conditions

Answers (2)

TL;DR

Step-by-step

Related Questions