Reputation: 23
I want to generate a heatmap of Jaccard indices, which are calculated by applying the calculation on vectors of strings. Thus, say I have 4 vectors, I want to calculate the Jaccard index for every combination of vectors and have the result as a matrix (4x4), so that each matrix cell would have the Jaccard index of specific combination. A toy example, my vectors are like so:
sample.set.1 <- c("A1", "B1", "C1", "D1")
sample.set.2 <- c("A2", "B1", "C1", "D2")
sample.set.3 <- c("A3", "B3", "C2", "D1")
sample.set.4 <- c("A4", "B4", "C4", "D4")
I can then calculate the jaccard index like so:
jaccard <- function(a, b){
shared.len <- length(intersect(a, b))
union <- (length(a)+length(b)) - shared.len
return(shared.len / union)
}
jaccard(sample.set.1, sample.set.2)
This gives me the Jaccard index for a specific comparison. My question is, can someone advise on a concise way of applying this to all vector combinations, leaving me with a 4 x 4 matrix (without repeating loads of code).
I could perform this by making every comparison using a loop, but I am interested in performing this using an implementation of R's apply function, or something similarly concise.
Upvotes: 0
Views: 47
Reputation: 79318
in Base R, using the function jaccard
as defined in your post, you could simply do:
samples <- mget(ls(pattern = "sample.set")) # Get all samples into a list
structure(combn(samples, 2, \(x)jaccard(x[[1]], x[[2]])),
Size = length(samples), Labels = names(samples), class = 'dist')
sample.set.1 sample.set.2 sample.set.3
sample.set.2 0.3333333
sample.set.3 0.1428571 0.0000000
sample.set.4 0.0000000 0.0000000 0.0000000
Upvotes: 1
Reputation: 5966
The dist
function from the proxy
package allows you to pass a custom function to compute distance. However the first thing to do is combine your sample.set
vectors into one object. I used mget
get pull them into a list and then passed your jaccard
function as the method. I'd also note that proxy
has the jaccard
similarity metric builtin.
proxy::dist(mget(grep("sample.set.\\d", ls(), value = T)), method=jaccard)
# sample.set.1 sample.set.2 sample.set.3
#sample.set.2 0.3333333
#sample.set.3 0.1428571 0.0000000
#sample.set.4 0.0000000 0.0000000 0.0000000
Upvotes: 1