AntonioRB
AntonioRB

Reputation: 159

Which order matters comparing text sequences?

I want to compare the similarity in some texts to detect duplicates, but if i use difflib, it returns different ratios depending on the order i give the data.

Some random example ....

Thanks

import difflib


a='josephpFRANCES'
b='ABswazdfsadSASAASASASAS'

seq=difflib.SequenceMatcher(None,a,b)
d=seq.ratio()*100
print(d)

seq2=difflib.SequenceMatcher(None,b,a)
d2=seq2.ratio()*100
print(d2)

d = 16.216216216216218

d2 = 10.81081081081081

Upvotes: 0

Views: 204

Answers (1)

blhsing
blhsing

Reputation: 107075

A diff ratio between a and b is done on the basis of "how much of b is different from a versus the length of a", so swapping a and b naturally yields different results. This is akin to "5 is 25% greater than 4" versus "4 is 20% less than 5". In your example, a is much shorter than b, so despite the same amount of difference between a and b, when the divisor is different due to the subject of the comparison being different, the diff ratio is different.

Upvotes: 1

Related Questions