HelloToEarth
HelloToEarth

Reputation: 2127

Classifier for matching two sets with similar ID strings in Python

I have 2 sets of data that have common feature sets but different labels to their ID names.

I want to see if there's an optimal classifier that can help me choose which name matches are the best choices based off these features.

Set 1 looks like:

Name         ID1           code1          move1        year
Highland     1             nc             st           2002
Highland     4             nc             st           2001
Highland     gt3           nc             st           2002
Highland     gt2           nc             st           2003
Mark         wt1           ns             st           2000
Mark         ws1           ns             st           1945
Mark         ost6          nc             ct           2002
Niko         1             ng             ct           2000
.
.

Set 2 looks like:

Name         ID2           code2          move2        year
Highland     gt1           nc             st           2002
Highland     gt3           nc             st           
Highland     2             nc             st           2003
Highland     gt4           nc             st           2001
Mark         t1            ns             st           2000
Mark         s1            nsi            st           
Mark         ost6          nci            ct           2002
Niko         1             ngi            ct           2000
.
.

As you can see there are some differences in both sets but Name is always the same - the IDs sometimes almost match and sometimes match perfectly. Other times the codes or moves match or are close and sometimes the years are just missing in general for one set.

I've calculated fuzzy ratios which use Levenshtein Distances for these IDs but they aren't enough for me to really make a good match.

Is there a way I can better identify these IDs using something like SVM?

Upvotes: 4

Views: 518

Answers (1)

farshad1123
farshad1123

Reputation: 325

Try fuzz.token_set_ratio() instead of fuzzy.ration(). Using fuzz.token_set_ratio() you will get a good matching.

For more information visit the docs.

Upvotes: 1

Related Questions