Which order matters comparing text sequences?

Question

I want to compare the similarity in some texts to detect duplicates, but if i use difflib, it returns different ratios depending on the order i give the data.

Some random example ....

Thanks

import difflib


a='josephpFRANCES'
b='ABswazdfsadSASAASASASAS'

seq=difflib.SequenceMatcher(None,a,b)
d=seq.ratio()*100
print(d)

seq2=difflib.SequenceMatcher(None,b,a)
d2=seq2.ratio()*100
print(d2)

d = 16.216216216216218

d2 = 10.81081081081081

blhsing · Accepted Answer

A diff ratio between a and b is done on the basis of "how much of b is different from a versus the length of a", so swapping a and b naturally yields different results. This is akin to "5 is 25% greater than 4" versus "4 is 20% less than 5". In your example, a is much shorter than b, so despite the same amount of difference between a and b, when the divisor is different due to the subject of the comparison being different, the diff ratio is different.

Which order matters comparing text sequences?

Answers (1)

Related Questions