blue-sky
blue-sky

Reputation: 53806

Measuring similarity between binary lists

I've two binary lists that I'm attempting to compare. To compare I sum where each corresponding value is equal and transform this to a percentage :

import numpy as np

l1 = [1,0,1]
l2 = [1,1,1]

print(np.dot(l1 , l2) / len(l1) * 100)

prints 66.666

So in this case l1 and l2 are 61.666 in terms of closeness. As each list is less similar the closeness value decreases.

For example using values :

l1 = [1,0,1]
l2 = [0,1,0]

returns 0.0

How to plot l1 and l2 that describe the relationship between l1 and l2 ? Is there a name for using this method to measure similarity between binary values ?

Using a scatter :

import matplotlib.pyplot as plt

plt.scatter( 'x', 'y', data=pd.DataFrame({'x': l1, 'y': l2 }))

produces :

enter image description here

But this does not make sense ?

Update :

"if both entries are 0, this will not contribute to your "similarity"

Using updated code below in order to compute similarity, this updated similarity measure includes corresponding 0 values in computing final score.

import numpy as np

l1 = [0,0,0]
l2 = [0,1,0]

print(len([a for a in np.isclose(l1 , l2) if(a)]) / len(l1) * 100)

which returns :

66.66666666666666

Alternatively, using below code with measure normalized_mutual_info_score returns 1.0 for lists that are the same or different, therefore normalized_mutual_info_score is not a suitable similarity measure ?

from sklearn.metrics.cluster import normalized_mutual_info_score

l1 = [1,0,1]
l2 = [0,1,0]

print(normalized_mutual_info_score(l1 , l2))

l1 = [0,0,0]
l2 = [0,0,0]

print(normalized_mutual_info_score(l1 , l2))

prints :

1.0
1.0

Upvotes: 2

Views: 1263

Answers (2)

kuzand
kuzand

Reputation: 9806

import numpy as np
import matplotlib.pyplot as plt

def unpackbits(a, n):
    ''' Unpacks an integer `a` to n-length binary list. ''' 
    return [a >> i & 1 for i in range(n-1,-1,-1)]


def similarity(a, b, n):
    ''' Similarity between n-length binary lists obtained from unpacking
    the integers a and b. '''
    a_unpacked = unpackbits(a, n)
    b_unpacked = unpackbits(b, n)
    return np.sum(np.isclose(a_unpacked, b_unpacked))/n


# Plot
n = 3
x = np.arange(2**n+1)
y = np.arange(2**n+1)
xx, yy = np.meshgrid(x, x)
z = np.vectorize(similarity)(yy[:-1,:-1], xx[:-1,:-1], n)

labels = [unpackbits(i, n) for i in x]
cmap = plt.cm.get_cmap('binary', n+1)

fig, ax = plt.subplots()
pc = ax.pcolor(x, y, z, cmap=cmap, edgecolor='k', vmin = 0, vmax=1)
ax.set_xticks(x + 0.5)
ax.set_yticks(y + 0.5)
ax.set_xlim(0, 2**n)
ax.set_ylim(0, 2**n)
ax.set_xticklabels(labels, rotation=45)
ax.set_yticklabels(labels)
cbar = fig.colorbar(pc, ax=ax, ticks=[i/n for i in range(n+1)])
cbar.ax.set_ylabel('similarity', fontsize=14)
ax.set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()

enter image description here

Upvotes: 0

blue note
blue note

Reputation: 29071

No, the plot does not make sense. What you are doing is essentially an inner product between vectors. According to this metric l1 and l2 are supposed to be vectors in a 3D (in this case) space, and this measures whether they face the same a similar direction and have similar length. The output is a scalar value so there's nothing to plot.

If you want to show the individual contribution of each component, you could do something like

contributions = [a==b for a, b in zip(l1, l2)]
plt.plot(list(range(len(contributions)), contributions)

but i'm still not sure that this makes sense.

Upvotes: 1

Related Questions