Magnus Hallas
Magnus Hallas

Reputation: 23

The difference between dist functions in r

I want to calculate the dissimilarity indices on a binary matrix and have found several functions in R, but I can't get them to agree. I use the jaccard coefficient as an example in the four functions: vegdist(), sim(), designdist(), and dist(). I'm going to use the result for a cluster analysis.

library(vegan)
library(simba)

#Create random binary matrix
function1 <- function(m, n) {
  matrix(sample(0:1, m * n, replace = TRUE), m, n)
}
test <- function1(30, 20)

#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard")
dist2 <- sim(test, method = "jaccard")
dist3 <- designdist(test, method = "a/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")

Does anyone know why dist1 and dist4 are different from dist2 and dist3?

Upvotes: 2

Views: 2024

Answers (1)

Jari Oksanen
Jari Oksanen

Reputation: 3682

I put this as an answer as well. Here the main comments for the dissimilarities you calculated:

  • dist1: you must set binary=TRUE in vegan::vegdist() (this is documented).

  • dist2: simba::sim() calculates Jaccard similarity and you must use 1-dist2. The ?sim documentation gives a wrong formula for Jaccard similarity, but uses the correct formula in code. However, the documented formula defines a similarity.

  • dist3: Your vegan::designdist() formula gives Jaccard similarity and you should change it to dissimilarity. There are many ways of doing this, and the code below gives one.

  • dist4: this is correctly done.

Replacing your four last lines with these will do the trick and give numerically identical results with all functions:

#Calculate dissimilarity indices with jaccard coefficient
dist1 <- vegdist(test, method = "jaccard", binary = TRUE)
dist2 <- 1 - sim(test, method = "jaccard")
dist3 <- designdist(test, method = "(b+c)/(a+b+c)", abcd = TRUE)
dist4 <- dist(test, method = "binary")

Upvotes: 2

Related Questions