Reputation: 2127
I have 2 sets of data that have common feature sets but different labels to their ID names.
I want to see if there's an optimal classifier that can help me choose which name matches are the best choices based off these features.
Set 1
looks like:
Name ID1 code1 move1 year
Highland 1 nc st 2002
Highland 4 nc st 2001
Highland gt3 nc st 2002
Highland gt2 nc st 2003
Mark wt1 ns st 2000
Mark ws1 ns st 1945
Mark ost6 nc ct 2002
Niko 1 ng ct 2000
.
.
Set 2
looks like:
Name ID2 code2 move2 year
Highland gt1 nc st 2002
Highland gt3 nc st
Highland 2 nc st 2003
Highland gt4 nc st 2001
Mark t1 ns st 2000
Mark s1 nsi st
Mark ost6 nci ct 2002
Niko 1 ngi ct 2000
.
.
As you can see there are some differences in both sets but Name
is always the same - the IDs sometimes almost match and sometimes match perfectly. Other times the codes
or moves
match or are close and sometimes the years are just missing in general for one set.
I've calculated fuzzy ratios
which use Levenshtein Distances
for these IDs but they aren't enough for me to really make a good match.
Is there a way I can better identify these IDs using something like SVM?
Upvotes: 4
Views: 518
Reputation: 325
Try fuzz.token_set_ratio()
instead of fuzzy.ration()
.
Using fuzz.token_set_ratio() you will get a good matching.
For more information visit the docs.
Upvotes: 1