Reputation: 49597
I was trying out python's difflib module and I came across SequenceMatcher
. So, I tried the following examples but couldn't understand what is happening.
>>> SequenceMatcher(None,"abc","a").ratio()
0.5
>>> SequenceMatcher(None,"aabc","a").ratio()
0.4
>>> SequenceMatcher(None,"aabc","aa").ratio()
0.6666666666666666
Now, according to the ratio:
Return a measure of the sequences' similarity as a float in the range [0, 1]. Where
T
is the total number of elements in both sequences, andM
is the number of matches, this is2.0*M / T
.
so, for my cases:
T=4
and M=1
so ratio 2*1/4 = 0.5
T=5
and M=2
so ratio 2*2/5 = 0.8
T=6
and M=1
so ratio 2*1/6.0 = 0.33
According to my understanding T = len(aabc) + len(a)
and M=2
because a
comes twice in aabc
.
So, where am I getting wrong what am I missing.?
Here is the source code of SequenceMatcher.ratio()
Upvotes: 8
Views: 14722
Reputation: 1
never too late...
from difflib import SequenceMatcher
texto1 = 'BRASILIA~DISTRITO FEDERAL, DF'
texto2 = 'BRASILIA-DISTRITO FEDERAL, '
tamanho_texto1 = len(texto1)
tamanho_texto2 = len(texto2)
tamanho_tot = tamanho_texto1 + tamanho_texto2
tot = 0
if texto1 <= texto2:
for x in range(len(texto1)):
y = texto1[x]
if y in texto2:
tot += 1
else:
for x in range(len(texto2)):
y = texto2[x]
if y in texto1:
tot += 1
print('sequenceM = ',SequenceMatcher(None, texto1, texto2).ratio())
print('Total calculado = ',2*tot/tamanho_tot)
sequenceM = 0.9285714285714286
Total calculado = 0.9285714285714286
Upvotes: 0
Reputation: 363817
You've got the first case right. In the second case, only one a
from aabc
matches, so M = 1. In the third example, both a
s match so M = 2.
[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]
Upvotes: 6