Reputation: 95
I have an data frame of size m x n of binary vectors with some unfilled values like the below sample
col1 col2 col3 col4 col5
V0 1 0 1
V1 1 1 0
V2 0 1 0 1
V3 0 0
I would like to compute a similarity matrix on this data frame such that I get a similarity score between any 2 vectors.
What is the best way to do this?
Note: I attempted replacing the NULL values with 2 and applied cosine similarity from the scipy library on the dataframe. The result matrix was not accurate/correct.
Upvotes: 0
Views: 553
Reputation: 367
You might want to use pdist or cdist with binary distance functions such as dice, jaccard or hamming (see the list of these functions at the end of this page).
Upvotes: 0