Reputation: 23
I want to calculate the dissimilarity indices on a binary matrix and have found several functions in R, but I can't get them to agree. I use the jaccard coefficient as an example in the four functions: vegdist()
, sim()
, designdist()
, and dist()
. I'm going to use the result for a cluster analysis.
library(vegan)
library(simba)
#Create random binary matrix
function1 <- function(m, n) {
matrix(sample(0:1, m * n, replace = TRUE), m, n)
}
test <- function1(30, 20)
#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard")
dist2 <- sim(test, method = "jaccard")
dist3 <- designdist(test, method = "a/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")
Does anyone know why dist1
and dist4
are different from dist2
and dist3
?
Upvotes: 2
Views: 2024
Reputation: 3682
I put this as an answer as well. Here the main comments for the dissimilarities you calculated:
dist1
: you must set binary=TRUE
in vegan::vegdist()
(this is
documented).
dist2
: simba::sim()
calculates Jaccard similarity and you must use 1-dist2
. The ?sim
documentation gives a wrong formula for Jaccard similarity, but uses the correct formula in code. However, the documented formula defines a similarity.
dist3
: Your vegan::designdist()
formula gives Jaccard similarity and you should change it to dissimilarity. There are many ways of doing this, and the code below gives one.
dist4
: this is correctly done.
Replacing your four last lines with these will do the trick and give numerically identical results with all functions:
#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard", binary = TRUE)
dist2 <- 1 - sim(test, method = "jaccard")
dist3 <- designdist(test, method = "(b+c)/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")
Upvotes: 2