Wizuriel
Wizuriel

Reputation: 3747

Pandas fuzzy merge/match name column, with duplicates

I have two dataframes currently, one for donors and one for fundraisers. I'm trying to find if any fundraisers also gave donations, and if so, copy some of that information into my fundraiser dataset (donor name, email and their first donation). Problems with my data are:

  1. I need to match by name and email, but a user might have slightly different names (ex 'Kat' and 'Kathy').
  2. Duplicate names for donors and fundraisers:
    • 2a) With donors I can get unique name/email combinations since I just care about the first donation date
    • 2b) With fundraisers though I need to keep both rows and not lose data like the date.

Sample code I have right now:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).

Ideally I want my fundraisers dataset to look like:

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-01-03 10:39:00     [email protected]      John Doe    John Doe    [email protected]      2013-03-27 10:00:00 
2013-02-03 10:39:00     [email protected]      Kathy test  Kat test    [email protected]      2013-03-27 10:39:00 
2013-03-03 10:39:00     [email protected]    Tes Ester   
2013-04-03 10:39:00     [email protected]  Jane Doe    Jane Doe    [email protected]  2013-04-03 10:39:00

Code:

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

Upvotes: 6

Views: 9854

Answers (3)

Shulaz Shan
Shulaz Shan

Reputation: 175

How to identify Fuzzy duplication in DataFrame using Pandas

This my data frame

def get_ratio(row):
name = row['Name_1']
return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
df[df.apply(get_ratio, axis=1) > 70]

Upvotes: 1

lostsoul29
lostsoul29

Reputation: 756

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

This is how I would do it with Jaro-Winkler from the jellyfish package:

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

Output:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

Update: Use jaro_winkler from the Levenshtein module for improved performance.

from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw

%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Upvotes: 1

roman
roman

Reputation: 117337

Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:

def get_donors(row):
    d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = donors.ix[d.idxmax(), ['name','Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)

Output:

                 Date           Email        name donor name     donor email           donor date
0 2013-03-27 10:00:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
1 2013-03-01 10:39:00          [email protected]    John Doe   John Doe          [email protected]  2013-03-01 10:39:00
2 2013-03-02 10:39:00          [email protected]  Kathy test   Kat test          [email protected]  2013-03-27 10:39:00
3 2013-03-03 10:39:00    [email protected]   Tes Ester                                                
4 2013-03-04 10:39:00  [email protected]    Jane Doe   Jane Doe  [email protected]  2013-03-04 10:39:00

Upvotes: 4

Related Questions