Measuring similarity between binary lists

Question

I've two binary lists that I'm attempting to compare. To compare I sum where each corresponding value is equal and transform this to a percentage :

import numpy as np

l1 = [1,0,1]
l2 = [1,1,1]

print(np.dot(l1 , l2) / len(l1) * 100)

prints 66.666

So in this case l1 and l2 are 61.666 in terms of closeness. As each list is less similar the closeness value decreases.

For example using values :

l1 = [1,0,1]
l2 = [0,1,0]

returns 0.0

How to plot l1 and l2 that describe the relationship between l1 and l2 ? Is there a name for using this method to measure similarity between binary values ?

Using a scatter :

import matplotlib.pyplot as plt

plt.scatter( 'x', 'y', data=pd.DataFrame({'x': l1, 'y': l2 }))

produces :

But this does not make sense ?

Update :

"if both entries are 0, this will not contribute to your "similarity"

Using updated code below in order to compute similarity, this updated similarity measure includes corresponding 0 values in computing final score.

import numpy as np

l1 = [0,0,0]
l2 = [0,1,0]

print(len([a for a in np.isclose(l1 , l2) if(a)]) / len(l1) * 100)

which returns :

66.66666666666666

Alternatively, using below code with measure normalized_mutual_info_score returns 1.0 for lists that are the same or different, therefore normalized_mutual_info_score is not a suitable similarity measure ?

from sklearn.metrics.cluster import normalized_mutual_info_score

l1 = [1,0,1]
l2 = [0,1,0]

print(normalized_mutual_info_score(l1 , l2))

l1 = [0,0,0]
l2 = [0,0,0]

print(normalized_mutual_info_score(l1 , l2))

prints :

1.0
1.0

blue note · Accepted Answer

No, the plot does not make sense. What you are doing is essentially an inner product between vectors. According to this metric l1 and l2 are supposed to be vectors in a 3D (in this case) space, and this measures whether they face the same a similar direction and have similar length. The output is a scalar value so there's nothing to plot.

If you want to show the individual contribution of each component, you could do something like

contributions = [a==b for a, b in zip(l1, l2)]
plt.plot(list(range(len(contributions)), contributions)

but i'm still not sure that this makes sense.

Measuring similarity between binary lists

Answers (2)

Related Questions