RanRag
RanRag

Reputation: 49597

How does SequenceMatcher.ratio works in difflib

I was trying out python's difflib module and I came across SequenceMatcher. So, I tried the following examples but couldn't understand what is happening.

>>> SequenceMatcher(None,"abc","a").ratio()
0.5

>>> SequenceMatcher(None,"aabc","a").ratio()
0.4

>>> SequenceMatcher(None,"aabc","aa").ratio()
0.6666666666666666

Now, according to the ratio:

Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.

so, for my cases:

  1. T=4 and M=1 so ratio 2*1/4 = 0.5
  2. T=5 and M=2 so ratio 2*2/5 = 0.8
  3. T=6 and M=1 so ratio 2*1/6.0 = 0.33

According to my understanding T = len(aabc) + len(a) and M=2 because a comes twice in aabc.

So, where am I getting wrong what am I missing.?

Here is the source code of SequenceMatcher.ratio()

Upvotes: 8

Views: 14722

Answers (2)

ARTHUR SIQUEIRA
ARTHUR SIQUEIRA

Reputation: 1

never too late...

from difflib import SequenceMatcher

texto1 = 'BRASILIA~DISTRITO FEDERAL, DF'
texto2 = 'BRASILIA-DISTRITO FEDERAL, '

tamanho_texto1 = len(texto1)
tamanho_texto2 = len(texto2)
tamanho_tot = tamanho_texto1 + tamanho_texto2

tot = 0
if texto1 <= texto2:
    for x in range(len(texto1)):
        y = texto1[x]

        if y in texto2:
            tot += 1
else:
    for x in range(len(texto2)):
        y = texto2[x]

        if y in texto1:
            tot += 1
            
print('sequenceM = ',SequenceMatcher(None, texto1, texto2).ratio())
print('Total calculado = ',2*tot/tamanho_tot)

sequenceM = 0.9285714285714286

Total calculado = 0.9285714285714286

Upvotes: 0

Fred Foo
Fred Foo

Reputation: 363817

You've got the first case right. In the second case, only one a from aabc matches, so M = 1. In the third example, both as match so M = 2.

[P.S.: you're referring to the ancient Python 2.4 source code. The current source code is at hg.python.org.]

Upvotes: 6

Related Questions