HIG
HIG

Reputation: 23

Create a matrix from operation on multiple lists in R

I want to generate a heatmap of Jaccard indices, which are calculated by applying the calculation on vectors of strings. Thus, say I have 4 vectors, I want to calculate the Jaccard index for every combination of vectors and have the result as a matrix (4x4), so that each matrix cell would have the Jaccard index of specific combination. A toy example, my vectors are like so:

sample.set.1 <- c("A1", "B1", "C1", "D1")
sample.set.2 <- c("A2", "B1", "C1", "D2")
sample.set.3 <- c("A3", "B3", "C2", "D1")
sample.set.4 <- c("A4", "B4", "C4", "D4")

I can then calculate the jaccard index like so:

jaccard <- function(a, b){
  shared.len <- length(intersect(a, b))
  union <- (length(a)+length(b)) - shared.len
  return(shared.len / union)
}
jaccard(sample.set.1, sample.set.2)

This gives me the Jaccard index for a specific comparison. My question is, can someone advise on a concise way of applying this to all vector combinations, leaving me with a 4 x 4 matrix (without repeating loads of code).

I could perform this by making every comparison using a loop, but I am interested in performing this using an implementation of R's apply function, or something similarly concise.

Upvotes: 0

Views: 47

Answers (2)

Onyambu
Onyambu

Reputation: 79318

in Base R, using the function jaccard as defined in your post, you could simply do:

samples <- mget(ls(pattern = "sample.set")) # Get all samples into a list

structure(combn(samples, 2, \(x)jaccard(x[[1]], x[[2]])),
     Size = length(samples), Labels = names(samples), class = 'dist')

             sample.set.1 sample.set.2 sample.set.3
sample.set.2    0.3333333                          
sample.set.3    0.1428571    0.0000000             
sample.set.4    0.0000000    0.0000000    0.0000000

Upvotes: 1

emilliman5
emilliman5

Reputation: 5966

The dist function from the proxy package allows you to pass a custom function to compute distance. However the first thing to do is combine your sample.set vectors into one object. I used mget get pull them into a list and then passed your jaccard function as the method. I'd also note that proxy has the jaccard similarity metric builtin.

proxy::dist(mget(grep("sample.set.\\d", ls(), value = T)), method=jaccard)

#             sample.set.1 sample.set.2 sample.set.3
#sample.set.2    0.3333333                          
#sample.set.3    0.1428571    0.0000000             
#sample.set.4    0.0000000    0.0000000    0.0000000

Upvotes: 1

Related Questions