Reputation: 21
I tried difflib get_close_matches() but I want more fine-grained control, so I wanted to do something similar myself. I looked at the source for get_close_matches() and it just obtains the ratios and returns the best one (or more).
get_close_matches() uses ratio() and picks the highest result. However, when doing the ratio() individually, the correct match (chosen by get_close_matches()) doesn't have the highest ratio.
I'm not sure if I'm using it incorrectly or what's going on.
Here is a test program showing what's going on:
import difflib
test_string = "ifthecalleridentifieshimselforherselfasinventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthathisorherassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthatheorshewillbecalledback"
test_list = ["ifthecalleridentifiestheirselfasaninventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthattheirassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthattheywillbecalledback",\
"2ifthecalleridentifiedtheirselfasaninventorapplicantoranauthorizedrepresentativeoftheassigneeofrecordpatentdataportalshouldbeusedtoverifythecorrespondenceaddressofrecord"]
print ("Radio of string to first item: ", difflib.SequenceMatcher(None, test_string, test_list[0]).ratio())
print ("Radio of string to second item:", difflib.SequenceMatcher(None, test_string, test_list[1]).ratio())
print (difflib.get_close_matches(test_string, test_list, n=1, cutoff=0.1))
And here are the results I'm getting:
Ratio of string to first list element: 0.4924114671163575
Ratio of string to second list element: 0.5520169851380042
['ifthecalleridentifiestheirselfasaninventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthattheirassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthattheywillbecalledback']
The ratio for the second element is higher, but bet_close_matches correctly returns the first element.
Upvotes: 0
Views: 36
Reputation: 1
They seem to be using SequenceMatcher.quick_ratio
(see this link):
s = SequenceMatcher()
s.set_seq2(word)
for x in possibilities:
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), x))
When I switch your code to quick ratio, I get the following:
import difflib
test_string = "ifthecalleridentifieshimselforherselfasinventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthathisorherassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthatheorshewillbecalledback"
test_list = ["ifthecalleridentifiestheirselfasaninventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthattheirassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthattheywillbecalledback",\
"2ifthecalleridentifiedtheirselfasaninventorapplicantoranauthorizedrepresentativeoftheassigneeofrecordpatentdataportalshouldbeusedtoverifythecorrespondenceaddressofrecord"]
print ("Ratio of string to first item: ", difflib.SequenceMatcher(None, test_string, test_list[0]).quick_ratio())
print ("Ratio of string to second item:", difflib.SequenceMatcher(None, test_string, test_list[1]).quick_ratio())
print (difflib.get_close_matches(test_string, test_list, n=1, cutoff=0.1))
Ratio of string to first item: 0.9612141652613828
Ratio of string to second item: 0.7091295116772823
['ifthecalleridentifiestheirselfasaninventoranapplicantoranauthorizedrepresentativeoftheassigneeofrecordaskforthecorrespondenceaddressofrecordandinformcallerthattheirassociationwiththeapplicationmustbeverifiedbeforeanyinformationconcerningtheapplicationcanbereleasedandthattheywillbecalledback']
Upvotes: 0