ykoo
ykoo

Reputation: 249

how to calculate jaccard similarity on dataframe in R?

I am trying to find similar users using jaccard similarity.

I want to change original df like result df.

The result df column value is the intersection/union.

for example..

The similarity between User 1 and User 2 is 1/2.

1/2 = The number of items both users have in common/ Total number of items both users have

In this way, I want to create an result df that calculates the similarity of all users.

What should I do?

enter image description here

enter image description here

Upvotes: 0

Views: 1298

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

Write a function which calculates number of items both user have in common divide by the total number of items.

calc <- function(x, y) {
  sum(x == 'Y' & y == 'Y')/sum(x == 'Y' | y == 'Y')  
}

Split the data rowwise and use outer :

tmp <- asplit(df, 1)
outer(tmp, tmp, Vectorize(calc))

#      [,1] [,2] [,3]  [,4] [,5]
#[1,] 1.000  0.5  0.0 0.333  0.4
#[2,] 0.500  1.0  0.0 0.000  0.2
#[3,] 0.000  0.0  1.0 0.000  0.4
#[4,] 0.333  0.0  0.0 1.000  0.4
#[5,] 0.400  0.2  0.4 0.400  1.0

data

It would be helpful if you provide data in a reproducible format instead of an image.

df <- data.frame(item1 = c('Y', 'Y', 'N', 'N', 'Y'), 
                 item2 = c('Y', 'N', 'N', 'Y', 'Y'), 
                 item3 = c('N', 'N', 'Y', 'N', 'Y'), 
                 item4 = c('N', 'N', 'Y', 'N', 'Y'), 
                 item5 = c('N', 'N', 'N', 'Y', 'Y'))

Upvotes: 2

Related Questions