Fast computing of co-occurrence matrix from N vectors with labels

Question

I have a matrix that contains, in each of its N rows (iterations of a clustering algorithm), the cluster to which each of its M points (columns) belongs:

For instance:

data <- t(rmultinom(50, size = 7, prob = rep(0.1,10)))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    0    2    1    1    0    2    1     0
 [2,]    3    1    2    0    0    0    0    1    0     0
 [3,]    0    1    2    1    0    0    0    0    2     1
 [4,]    0    1    1    0    2    0    0    2    0     1
 [5,]    3    0    0    0    2    1    0    0    0     1
 [6,]    0    1    2    0    0    1    1    2    0     0
 [7,]    0    1    0    1    0    1    1    2    1     0
 [8,]    3    0    0    2    0    0    0    1    0     1
 ...

I want to build a co-occurrence matrix where the position (i,j) is a sum of the number of times that two points have been seen in the same cluster through the different rows.

A naive approach would be:

  coincidences <- matrix(0, nrow=10, ncol=10)
  for (n in 1:50){ 
    for (m in 1:10){
      coincidences[m,] <- coincidences[m,] + as.numeric(data[n,m] == data[n,])
      }
    }

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   50   17   21   22   15   14   16   20   18    18
 [2,]   17   50   17   14   17   18   15   14   20    16
 [3,]   21   17   50   20   21   16   16   13   16    20
 [4,]   22   14   20   50   16   18   16   21   18    14
 [5,]   15   17   21   16   50   18   16   17   11    17
 [6,]   14   18   16   18   18   50   18   22   25    13
 [7,]   16   15   16   16   16   18   50   14   20    22
 [8,]   20   14   13   21   17   22   14   50   11    15
 [9,]   18   20   16   18   11   25   20   11   50    18
[10,]   18   16   20   14   17   13   22   15   18    50

How may I make it faster?

Extra: how can I plot it using ggplot2? (I have seen heatmap.2 in the gplots but I don't know if this is an overkill)

fishtank · Accepted Answer

Faster way using vectorization and colSums:

> set.seed(1)
> data <- t(rmultinom(10000, size = 7, prob = rep(0.1,100)))
>
> system.time({
+  coincidences <- matrix(0, nrow=100, ncol=100)
+  for (n in 1:10000){
+    for (m in 1:100){
+      coincidences[m,] <- coincidences[m,] + as.numeric(data[n,m] == data[n,])
+      }
+    }}
+ )
   user  system elapsed
  9.692   0.000   9.708
>
> system.time(coincidences2<-sapply(1:ncol(data), function(i){ colSums(data[,i]==data) }))
   user  system elapsed
  0.676   0.096   0.774
>
> all.equal(coincidences2,coincidences)
[1] TRUE

Fast computing of co-occurrence matrix from N vectors with labels

Answers (2)

Rcpp

Edit

Related Questions