Reputation: 53806
I've two binary lists that I'm attempting to compare. To compare I sum where each corresponding value is equal and transform this to a percentage :
import numpy as np
l1 = [1,0,1]
l2 = [1,1,1]
print(np.dot(l1 , l2) / len(l1) * 100)
prints 66.666
So in this case l1 and l2 are 61.666 in terms of closeness. As each list is less similar the closeness value decreases.
For example using values :
l1 = [1,0,1]
l2 = [0,1,0]
returns 0.0
How to plot l1
and l2
that describe the relationship between l1
and l2
? Is there a name for using this method to measure similarity between binary values ?
Using a scatter :
import matplotlib.pyplot as plt
plt.scatter( 'x', 'y', data=pd.DataFrame({'x': l1, 'y': l2 }))
produces :
But this does not make sense ?
Update :
"if both entries are 0, this will not contribute to your "similarity"
Using updated code below in order to compute similarity, this updated similarity measure includes corresponding 0 values in computing final score.
import numpy as np
l1 = [0,0,0]
l2 = [0,1,0]
print(len([a for a in np.isclose(l1 , l2) if(a)]) / len(l1) * 100)
which returns :
66.66666666666666
Alternatively, using below code with measure normalized_mutual_info_score
returns 1.0 for lists that are the same or different, therefore normalized_mutual_info_score
is not a suitable similarity measure ?
from sklearn.metrics.cluster import normalized_mutual_info_score
l1 = [1,0,1]
l2 = [0,1,0]
print(normalized_mutual_info_score(l1 , l2))
l1 = [0,0,0]
l2 = [0,0,0]
print(normalized_mutual_info_score(l1 , l2))
prints :
1.0
1.0
Upvotes: 2
Views: 1263
Reputation: 9806
import numpy as np
import matplotlib.pyplot as plt
def unpackbits(a, n):
''' Unpacks an integer `a` to n-length binary list. '''
return [a >> i & 1 for i in range(n-1,-1,-1)]
def similarity(a, b, n):
''' Similarity between n-length binary lists obtained from unpacking
the integers a and b. '''
a_unpacked = unpackbits(a, n)
b_unpacked = unpackbits(b, n)
return np.sum(np.isclose(a_unpacked, b_unpacked))/n
# Plot
n = 3
x = np.arange(2**n+1)
y = np.arange(2**n+1)
xx, yy = np.meshgrid(x, x)
z = np.vectorize(similarity)(yy[:-1,:-1], xx[:-1,:-1], n)
labels = [unpackbits(i, n) for i in x]
cmap = plt.cm.get_cmap('binary', n+1)
fig, ax = plt.subplots()
pc = ax.pcolor(x, y, z, cmap=cmap, edgecolor='k', vmin = 0, vmax=1)
ax.set_xticks(x + 0.5)
ax.set_yticks(y + 0.5)
ax.set_xlim(0, 2**n)
ax.set_ylim(0, 2**n)
ax.set_xticklabels(labels, rotation=45)
ax.set_yticklabels(labels)
cbar = fig.colorbar(pc, ax=ax, ticks=[i/n for i in range(n+1)])
cbar.ax.set_ylabel('similarity', fontsize=14)
ax.set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
Upvotes: 0
Reputation: 29071
No, the plot does not make sense. What you are doing is essentially an inner product between vectors. According to this metric l1
and l2
are supposed to be vectors in a 3D (in this case) space, and this measures whether they face the same a similar direction and have similar length. The output is a scalar value so there's nothing to plot.
If you want to show the individual contribution of each component, you could do something like
contributions = [a==b for a, b in zip(l1, l2)]
plt.plot(list(range(len(contributions)), contributions)
but i'm still not sure that this makes sense.
Upvotes: 1