Akshay Bharadwaj
Akshay Bharadwaj

Reputation: 95

What is the best way to compute a similarity matrix for a dataframe of binary vectors?

I have an data frame of size m x n of binary vectors with some unfilled values like the below sample

col1 col2 col3 col4 col5
 V0    1         0    1
 V1    1    1         0
 V2    0    1    0    1
 V3         0    0

I would like to compute a similarity matrix on this data frame such that I get a similarity score between any 2 vectors.

What is the best way to do this?

Note: I attempted replacing the NULL values with 2 and applied cosine similarity from the scipy library on the dataframe. The result matrix was not accurate/correct.

Upvotes: 0

Views: 553

Answers (1)

jpl
jpl

Reputation: 367

You might want to use pdist or cdist with binary distance functions such as dice, jaccard or hamming (see the list of these functions at the end of this page).

Upvotes: 0

Related Questions