Reputation: 1043
I have a dataframe as follows: the shape of the frame is (1510, 1399). The columns represent products, the rows represent values (0 or 1) assigned by a user for a given product. How can I can compute jaccard_similarity_score
s?
I created a placeholder dataframe listing product vs. product
data_ibs = pd.DataFrame(index=data_g.columns,columns=data_g.columns)
I am not sure how to iterate though data_ibs to compute similarities.
for i in range(0,len(data_ibs.columns)) :
# Loop through the columns for each column
for j in range(0,len(data_ibs.columns)) :
.........
Upvotes: 31
Views: 41705
Reputation: 23141
Jaccard similarity scores can also be calculated using scipy.spatial.distance.pdist
. One of its metrics is 'jaccard'
which computes jaccard dissimilarity (so that the score has to be subtracted from 1 to get jaccard similarity). It returns a 1D array where each value corresponds to the jaccard similarity between two columns.
One could construct a Series from the scores by constructing a MultiIndex.
from scipy.spatial.distance import pdist
jaccard_similarity = pd.Series(1 - pdist(df.values.T, metric='jaccard'), index=pd.MultiIndex.from_tuples([(c1, c2) for i, c1 in enumerate(df) for c2 in df.columns[i+1:]]))
Using ayhan's setup, it produces the following:
A B 0.300000
C 0.457143
D 0.342857
E 0.466667
B C 0.294118
D 0.333333
E 0.233333
C D 0.405405
E 0.441176
D E 0.363636
dtype: float64
If a matrix is desired, it can be constructed from pdist
as well. Just construct an empty matrix and fill the off-diagonals by these values (and the diagonal by 1).
from scipy.spatial.distance import pdist
def jaccard_similarity_matrix(df):
n = df.shape[1]
scores = 1 - pdist(np.array(df).T, metric='jaccard')
result = np.zeros((n,n))
result[np.triu_indices(n, k=1)] = scores
result += result.T
np.fill_diagonal(result, 1)
return pd.DataFrame(result, index=df.columns, columns=df.columns)
jaccard_similarity = jaccard_similarity_matrix(df)
In fact, by using the source code of pdist
, an entirely custom function that only uses numpy and basic python may be written as well.
def jaccard_matrix(df):
def jaccard(x, y):
nonzero = (x != 0) | (y != 0)
a = ((x != y) & nonzero).sum()
b = nonzero.sum()
return 1 - a / b if b != 0 else 1
arr = df.values
n = arr.shape[1]
scores = [jaccard(arr[:, i], arr[:, j]) for i in range(n-1) for j in range(i+1, n)]
result = np.zeros((n, n))
result[np.triu_indices(n, k=1)] = scores
result += result.T
np.fill_diagonal(result, 1)
return pd.DataFrame(result, index=df.columns, columns=df.columns)
All of these functions return the same output which can be verified as follows:
df = pd.DataFrame(np.random.default_rng().binomial(1, 0.5, size=(100, 10))).add_prefix('col')
x = pd.DataFrame(1 - pairwise_distances(df.values.T.astype(bool), metric='jaccard'), index=df.columns, columns=df.columns)
y = jaccard_similarity_matrix(df)
z = jaccard_matrix(df)
np.allclose(x, y) and np.allclose(y, z) # True
Upvotes: 0
Reputation:
Use pairwise_distances
to calculate the distance and subtract that distance from 1 to find the similarity score:
from sklearn.metrics.pairwise import pairwise_distances
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
Explanation:
In newer versions of scikit learn, the definition of jaccard_score
is similar to the Jaccard similarity coefficient definition in Wikipedia:
where
Let's create a sample dataset to see if the results match:
from pandas import DataFrame, crosstab
from numpy.random import default_rng
rng = default_rng(0)
# Create a dataframe of 40 rows and 5 columns (named A, B, C, D, E)
# Each cell in the DataFrame is either 0 or 1 with 50% probability
df = DataFrame(rng.binomial(1, 0.5, size=(40, 5)), columns=list('ABCDE'))
This yields the following crosstab for columns A and B:
A/B | 0 | 1 |
---|---|---|
0 | 10 | 7 |
1 | 14 | 9 |
Based on the definition, the Jaccard similarity score is:
M00 = (df['A'].eq(0) & df['B'].eq(0)).sum() # 10
M01 = (df['A'].eq(0) & df['B'].eq(1)).sum() # 7
M10 = (df['A'].eq(1) & df['B'].eq(0)).sum() # 14
M11 = (df['A'].eq(1) & df['B'].eq(1)).sum() # 9
print(M11 / (M01 + M10 + M11)) # 0.3
This is what you would get with jaccard_score
:
from sklearn.metrics import jaccard_score
print(jaccard_score(df['A'], df['B'])) # 0.3
The problem with the jaccard_score
function is that it is not vectorized. You'll have to loop over all columns to calculate the similarity score for each corresponding column. In order to avoid that, you can use the vectorized distance version. However, since it is "distance" but not "similarity", you'll need to subtract that value from 1:
from sklearn.metrics.pairwise import pairwise_distances
print(1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'))
# [[1. 0.3 0.45714286 0.34285714 0.46666667]
# [0.3 1. 0.29411765 0.33333333 0.23333333]
# [0.45714286 0.29411765 1. 0.40540541 0.44117647]
# [0.34285714 0.33333333 0.40540541 1. 0.36363636]
# [0.46666667 0.23333333 0.44117647 0.36363636 1. ]]
Optionally, you can convert it back to a DataFrame:
jac_sim = 1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
jac_sim_df = DataFrame(
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'),
index=df.columns, columns=df.columns,
)
# A B C D E
# A 1.000000 0.300000 0.457143 0.342857 0.466667
# B 0.300000 1.000000 0.294118 0.333333 0.233333
# C 0.457143 0.294118 1.000000 0.405405 0.441176
# D 0.342857 0.333333 0.405405 1.000000 0.363636
# E 0.466667 0.233333 0.441176 0.363636 1.000000
Note: In the previous version of this answer, the calculations used the hamming metric with pairwise_distances
because in earlier versions of scikit-learn, jaccard_score
was calculated similar to the accuracy score (i.e. (M00 + M11) / (M00 + M01 + M10 + M11)). That is no longer the case so the answer was updated to use the jaccard
metric instead of hamming
.
Upvotes: 87