David 54321
David 54321

Reputation: 728

Create a Fuzzy Duplicate Key to Sum Rows with Fuzzy Matches (Pandas)

So I have a table where I have identified fuzzy matches and an amount. I want to be able to summarize the amount by this common key.

My Data looks like this:

Name Match1 Match2 Amount
Jame James Jim 1
Jame James Jim 2
Mike Mikes Miike 3
James Jame Jim 4
Jim Jame James 5
Mikes Mike Miike 6
Miike Mike Mikes 7
data = {'Name': {0: 'Jame',
  1: 'Jame',
  2: 'Mike',
  3: 'James',
  4: 'Jim',
  5: 'Mikes',
  6: 'Miike'},
 'Match1': {0: 'James',
  1: 'James',
  2: 'Mikes',
  3: 'Jame',
  4: 'Jame',
  5: 'Mike',
  6: 'Mike'},
 'Match2': {0: 'Jim',
  1: 'Jim',
  2: 'Miike',
  3: 'Jim',
  4: 'James',
  5: 'Miike',
  6: 'Mikes'},
 'Amount': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}}

df = pd.DataFrame.from_dict(data)

What I'd like my resulting table to look like:

Key Name Match1 Match2 Amount
Jame Jame James Jim 1
Jame Jame James Jim 2
Mike Mike Mikes Miike 3
Jame James Jame Jim 4
Jame Jim Jame James 5
Mike Mikes Mike Miike 6
Mike Miike Mike Mikes 7

So I can summarize my data like this:

Name Amount
Jame 12
Mike 16

Upvotes: 3

Views: 139

Answers (1)

user16836078
user16836078

Reputation:

If you initially have the list of keys for your dataframe, you can do this.

mylist = ['Jame','Mike']

df['Key'] = [j for i in df.to_numpy() for j in mylist if j in i]

df.set_index(df.columns[-1]).reset_index(inplace=True)

    Key   Name Match1 Match2  Amount
0  Jame   Jame  James    Jim       1
1  Jame   Jame  James    Jim       2
2  Mike   Mike  Mikes  Miike       3
3  Jame  James   Jame    Jim       4
4  Jame    Jim   Jame  James       5
5  Mike  Mikes   Mike  Miike       6
6  Mike  Miike   Mike  Mikes       7

And then, you can do groupby.sum to get your desired output.

df.groupby('Key')['Amount'].sum().reset_index()
    Key  Amount
0  Jame      12
1  Mike      16

Update

To obtain the key through the dataframe, you can use the process.extractOne in fuzzywuzzy to compare the matching percentage of every element in a row. If it's less than the threshold, we will update the key.

from fuzzywuzzy import process

key = [df.Name[0]]

for row in df.to_numpy():
    for i in row:
        if 50 < process.extractOne(i, key)[1]:
            break
        else:
            key.append(row[0])
            break
            

key
Out[65]: ['Jame', 'Mike']

Upvotes: 1

Related Questions