kitchenprinzessin
kitchenprinzessin

Reputation: 1043

How to compute jaccard similarity from a pandas dataframe

I have a dataframe as follows: the shape of the frame is (1510, 1399). The columns represent products, the rows represent values (0 or 1) assigned by a user for a given product. How can I can compute jaccard_similarity_scores?

enter image description here

I created a placeholder dataframe listing product vs. product

data_ibs = pd.DataFrame(index=data_g.columns,columns=data_g.columns)

I am not sure how to iterate though data_ibs to compute similarities.

for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
        .........

Upvotes: 31

Views: 41705

Answers (2)

cottontail
cottontail

Reputation: 23141

Jaccard similarity scores can also be calculated using scipy.spatial.distance.pdist. One of its metrics is 'jaccard' which computes jaccard dissimilarity (so that the score has to be subtracted from 1 to get jaccard similarity). It returns a 1D array where each value corresponds to the jaccard similarity between two columns.

One could construct a Series from the scores by constructing a MultiIndex.

from scipy.spatial.distance import pdist
jaccard_similarity = pd.Series(1 - pdist(df.values.T, metric='jaccard'), index=pd.MultiIndex.from_tuples([(c1, c2) for i, c1 in enumerate(df) for c2 in df.columns[i+1:]]))

Using ayhan's setup, it produces the following:

A  B    0.300000
   C    0.457143
   D    0.342857
   E    0.466667
B  C    0.294118
   D    0.333333
   E    0.233333
C  D    0.405405
   E    0.441176
D  E    0.363636
dtype: float64

If a matrix is desired, it can be constructed from pdist as well. Just construct an empty matrix and fill the off-diagonals by these values (and the diagonal by 1).

from scipy.spatial.distance import pdist

def jaccard_similarity_matrix(df):
    
    n = df.shape[1]
    scores = 1 - pdist(np.array(df).T, metric='jaccard')
    result = np.zeros((n,n))
    result[np.triu_indices(n, k=1)] = scores
    result += result.T
    np.fill_diagonal(result, 1)
    return pd.DataFrame(result, index=df.columns, columns=df.columns)

jaccard_similarity = jaccard_similarity_matrix(df)

result


In fact, by using the source code of pdist, an entirely custom function that only uses numpy and basic python may be written as well.

def jaccard_matrix(df):

    def jaccard(x, y):
        nonzero = (x != 0) | (y != 0)
        a = ((x != y) & nonzero).sum()
        b = nonzero.sum()
        return 1 - a / b if b != 0 else 1
    
    arr = df.values
    n = arr.shape[1]
    scores = [jaccard(arr[:, i], arr[:, j]) for i in range(n-1) for j in range(i+1, n)]
    result = np.zeros((n, n))
    result[np.triu_indices(n, k=1)] = scores
    result += result.T
    np.fill_diagonal(result, 1)
    return pd.DataFrame(result, index=df.columns, columns=df.columns)

All of these functions return the same output which can be verified as follows:

df = pd.DataFrame(np.random.default_rng().binomial(1, 0.5, size=(100, 10))).add_prefix('col')
x = pd.DataFrame(1 - pairwise_distances(df.values.T.astype(bool), metric='jaccard'), index=df.columns, columns=df.columns)
y = jaccard_similarity_matrix(df)
z = jaccard_matrix(df)

np.allclose(x, y) and np.allclose(y, z)    # True

Upvotes: 0

user2285236
user2285236

Reputation:

Use pairwise_distances to calculate the distance and subtract that distance from 1 to find the similarity score:

from sklearn.metrics.pairwise import pairwise_distances
1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')

Explanation:

In newer versions of scikit learn, the definition of jaccard_score is similar to the Jaccard similarity coefficient definition in Wikipedia:

where

  • M11 represents the total number of attributes where A and B both have a value of 1.
  • M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
  • M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
  • M00 represents the total number of attributes where A and B both have a value of 0.

Let's create a sample dataset to see if the results match:

from pandas import DataFrame, crosstab
from numpy.random import default_rng
rng = default_rng(0)

# Create a dataframe of 40 rows and 5 columns (named A, B, C, D, E)
# Each cell in the DataFrame is either 0 or 1 with 50% probability
df = DataFrame(rng.binomial(1, 0.5, size=(40, 5)), columns=list('ABCDE'))

This yields the following crosstab for columns A and B:

A/B 0 1
0 10 7
1 14 9

Based on the definition, the Jaccard similarity score is:

M00 = (df['A'].eq(0) & df['B'].eq(0)).sum()  # 10
M01 = (df['A'].eq(0) & df['B'].eq(1)).sum()  # 7
M10 = (df['A'].eq(1) & df['B'].eq(0)).sum()  # 14
M11 = (df['A'].eq(1) & df['B'].eq(1)).sum()  # 9


print(M11 / (M01 + M10 + M11))  # 0.3

This is what you would get with jaccard_score:

from sklearn.metrics import jaccard_score
print(jaccard_score(df['A'], df['B']))  # 0.3

The problem with the jaccard_score function is that it is not vectorized. You'll have to loop over all columns to calculate the similarity score for each corresponding column. In order to avoid that, you can use the vectorized distance version. However, since it is "distance" but not "similarity", you'll need to subtract that value from 1:

from sklearn.metrics.pairwise import pairwise_distances
print(1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'))

# [[1.         0.3        0.45714286 0.34285714 0.46666667]
#  [0.3        1.         0.29411765 0.33333333 0.23333333]
#  [0.45714286 0.29411765 1.         0.40540541 0.44117647]
#  [0.34285714 0.33333333 0.40540541 1.         0.36363636]
#  [0.46666667 0.23333333 0.44117647 0.36363636 1.        ]]

Optionally, you can convert it back to a DataFrame:

jac_sim = 1 - pairwise_distances(df.T.to_numpy(), metric='jaccard')
jac_sim_df = DataFrame(
    1 - pairwise_distances(df.T.to_numpy(), metric='jaccard'), 
    index=df.columns, columns=df.columns,
)

#           A         B         C         D         E
#  A  1.000000  0.300000  0.457143  0.342857  0.466667
#  B  0.300000  1.000000  0.294118  0.333333  0.233333
#  C  0.457143  0.294118  1.000000  0.405405  0.441176
#  D  0.342857  0.333333  0.405405  1.000000  0.363636
#  E  0.466667  0.233333  0.441176  0.363636  1.000000

Note: In the previous version of this answer, the calculations used the hamming metric with pairwise_distances because in earlier versions of scikit-learn, jaccard_score was calculated similar to the accuracy score (i.e. (M00 + M11) / (M00 + M01 + M10 + M11)). That is no longer the case so the answer was updated to use the jaccard metric instead of hamming.

Upvotes: 87

Related Questions