Lui Hellesoe
Lui Hellesoe

Reputation: 384

Check string values in one column df to another column df with multiple conditions

I have two data frames:

import pandas as pd

first_df = pd.DataFrame({'Full Name': ['Mulligan Nick & Mary', 'Tsang S C', 'Hattie J A C '],
                         'Address': ['270 Claude Road', '13 Sunnyridge Place', '18A Empire Road']})

second_df = pd.DataFrame({'Owner' : ['David James Mulligan', 'Brenda Joy Mulligan ', 'Helen Kwok Hattie'],
                          'Add Match': ['19 Dexter Avenue', 'Claude Road ', 'Building NO 512']})

Is there anyway to match only the first string in Full Name column to the last string in Owner column.

If there is a match, I then want to compare Address against Add match to see if there are any like values. If the first condition passes but the second condition fails, this would not be added into the new data frame.

Using a left join results in:

new_df = first_df.merge(second_df, how='left', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])
print(new_df.head())

              Full Name              Address Owner Add Match
0  Mulligan Nick & Mary      270 Claude Road   NaN       NaN
1             Tsang S C  13 Sunnyridge Place   NaN       NaN
2         Hattie J A C       18A Empire Road   NaN       NaN

However the output wanted would look more like this:

new_df

Name                 Address
----                 --------
Brenda Joy Mulligan  Claude Road

Upvotes: 0

Views: 107

Answers (2)

xicocaio
xicocaio

Reputation: 887

Inspired by this answer you could employ a similar solution.

TL;DR

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
df_final = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
address_matches = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[address_matches].drop(columns=['last_name', 'start_name', 'Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})

Step-by-step

Initially, you extract the last name keys you want.

first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]

PS: Here we are using built-in string methods from pandas/numpy combo given your instructions. But if it fits you better you could also apply the similarity methods (e.g., difflib.get_close_matches) shown down below for the address part.

Next, you perform an inner join of these dataframes to match the last_name key.

df_temp = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])

Then you apply the difflib.get_close_matches with the desired similarity (I used cutoff=0.8 because above this value there were no values returned) method to mark which rows contain matches and subsequently get only the rows you want.

matches_mask = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[matches_mask].drop(columns=['last_name', 'start_name'])
Full Name               Address             Owner                   Add Match

Mulligan Nick & Mary    270 Claude Road     Brenda Joy Mulligan     Claude Road

Finally, to get match the format of the results posted at the end of your question you drop or rename some columns.

df_final.drop(columns=['Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Owner                   Add Match

Brenda Joy Mulligan     Claude Road

Upvotes: 1

Laurent
Laurent

Reputation: 13478

You could take advantage of the difflib module from Python standard library to find similarities between different columns. For instance, you can define the following function:

from difflib import SequenceMatcher

def compare_df(left, right, col: str):
    left[f"{col}_match_ratio"] = 0

    for value in left[col]:
        best_ratio = 0
        for other in right[col]:
            result = SequenceMatcher(None, str(value), str(other)).ratio()
            if result > best_ratio:
                best_ratio = result
        left.loc[left[col] == value, f"{col}_match_ratio"] = round(best_ratio, 2)

Then:

  • you just have to make sure that the column you want to compare on have the same name in both dfs
  • you call df_compare(first_df, second_df, "Owner") which will add "Owner_match_ratio" column to second_df
  • finally, you filter second df on the desired minimum match ratio (70 % for instance) like this: new_df = second_df.loc[second_df["Owner_match_ratio"] > 0.7, :]

Upvotes: 2

Related Questions