anat
anat

Reputation: 143

calculate jaccard distance between rows in r

I need to calculate jaccard distance between each row in a data frame. the return need to be a matrix/data frame that represent the distance.

like this:

   1     2   3 ..
1  0    0.2  1 
2  0.2  0    0.4
3  1    0.4  0
.
.

my data:

dput(items[1:10,])

structure(list(Drama = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), Comedy = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Crime = c(0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), SciFi = c(1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L), Kids = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L), Classic = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
0L), Foreign = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Thriller = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Action = c(0L, 0L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), Adventure = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), Animation = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), Adult = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), History = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Documentry = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), Nature = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), Horror = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
0L), Show = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Series = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), BlackWhite = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Drama", "Comedy", "Crime", 
"SciFi", "Kids", "Classic", "Foreign", "Thriller", "Action", 
"Adventure", "Animation", "Adult", "History", "Documentry", "Nature", 
"Horror", "Show", "Series", "BlackWhite"), row.names = c(NA, 
10L), class = "data.frame")

my code:

Jaccard_dist <- dist(items, items, method = "Jaccard")

write.csv(Jaccard_dist,'Jaccard_dist.csv')

do you know of a way to do this without using two for-loops?

Upvotes: 5

Views: 11173

Answers (2)

teadoub
teadoub

Reputation: 63

It seems that the "binary" method of R's native dist() function does in fact provide the Jaccard distance without naming it specifically. The description fits ("The vectors are regarded as binary bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.") and so does the output (exactly the same as in the accepted answer):

> dist(data, method = "binary")
           1         2         3         4         5         6         7         8         9
2  1.0000000                                                                                
3  1.0000000 0.6666667                                                                      
4  0.8000000 0.8000000 1.0000000                                                            
5  1.0000000 0.8000000 0.6666667 0.8000000                                                  
6  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667                                        
7  1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000                              
8  0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000                    
9  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667          
10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000

Upvotes: 5

Aramis7d
Aramis7d

Reputation: 2496

Not sure why you need two for loops.

You can try the library proxy and use:

proxy::dist(dft, by_rows = TRUE, method = "Jaccard")

This returns:

#
       1         2         3         4         5         6         7         8         9
#2  1.0000000                                                                                
#3  1.0000000 0.6666667                                                                      
#4  0.8000000 0.8000000 1.0000000                                                            
#5  1.0000000 0.8000000 0.6666667 0.8000000                                                  
#6  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667                                        
#7  1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000                              
#8  0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000                    
#9  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667          
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000

Upvotes: 5

Related Questions