How to link two dataframes based on the string similarity of one column

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:

Dataframes:

DF-1                              DF-2
---------------------             ---------------------
     ID          Name                  ID          Name
1    56       aaeessa             1    12    H.P paRt 1 
2    98       1o7v9sM             2    76       aa3esza
3   175    HP. part 1             3   762    stakoverfl 
4     2     stackover             4     2       lo7v9Sm

I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:

Result:

Result
-----------------
     ID1     ID2
1    56       76
2    98        2
3   175       12
4     2      762

This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.

EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).

Upvotes: 3

Views: 3779

Answers (3)

Erfan
Erfan

Reputation: 42926

Using my fuzzy_wuzzy function from the linked answer:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
      .merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
      .filter(like='ID')

Output

   ID1  ID2
0   56   76
1   98    2
2  175   12
3    2  762

Upvotes: 1

SCool
SCool

Reputation: 3375

Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/

You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.

first pip install pandas-dedupe and try this:

import pandas as pd
import pandas_dedupe

df1=pd.DataFrame({'ID':[56,98,175],
                 'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})

df2=pd.DataFrame({'ID':[12,76,762,2],
                 'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})


#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])

# reset index
df_final = df_final.reset_index(drop=True)

# print result

print(df_final)

    ID        Name  cluster id  confidence
0   98     1o7v9sm         0.0    1.000000
1    2     lo7v9sm         0.0    1.000000
2  175  hp. part 1         1.0    0.999999
3   12  h.p part 1         1.0    0.999999
4   56     aaeessa         2.0    0.999967
5   76     aa3esza         2.0    0.999967
6  762  stakoverfl         NaN         NaN

you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.

Upvotes: 4

powerPixie
powerPixie

Reputation: 708

I suggest you a library called Python Record Linkage Toolkit.

Once you import the library, you must index the sources you intend to compare, something like this:

 indexer = recordlinkage.Index()

 #using url as intersection
 indexer.block('id')
 candidate_links = indexer.index(df_1, df_2)

 c = recordlinkage.Compare()

Let's say you want to compare based on the similiraties of strings, but they don't match exactly:

 c.string('name', 'name', method='jarowinkler', threshold=0.85)

And if you want an exact match you should use:

c.exact('name')

Upvotes: 1

Related Questions