Efficiently combine (AND) groups of columns in a logical matrix

Question

I am looking for an efficient way to combine selected columns in a logical matrix by "ANDing" them together and ending up with a new matrix. An example of what I am looking for:

matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE  TRUE FALSE  TRUE
[2,]  TRUE FALSE  TRUE  TRUE
[3,] FALSE  TRUE  TRUE FALSE
[4,]  TRUE  TRUE FALSE  TRUE
[5,]  TRUE FALSE  TRUE  TRUE
[6,] FALSE  TRUE  TRUE FALSE

The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix), where the columns to be grouped together ANDed have the same value (a value from 1 to n, where n <= ncol(exampleMatrix) and every value in 1:n is used at least once). The resulting matrix should have the columns in order from 1:n. For example, if the vector that specifies the column groups is

colGroups <- c(3, 2, 2, 1)

Then the resulting matrix would be

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE  TRUE
[2,]  TRUE FALSE  TRUE
[3,] FALSE  TRUE FALSE
[4,]  TRUE FALSE  TRUE
[5,]  TRUE FALSE  TRUE
[6,] FALSE  TRUE FALSE

Where in the resulting matrix

[,1] = exampleMatrix[,4] 
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]

My current way of doing this looks basically like this:

finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
    selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
    finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}

Where rowSums(selectedColumns)==ncol(selectedColumns) is an efficient way to AND all of the columns of a matrix together.

My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for loop many times but instead it is the operation in the loop that is slow.

In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc() within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!

For a more representative example, this is a much larger exampleMatrix:

matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)

MichaelChirico · Accepted Answer

From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups (30% improvement over your suggestion):

for (jj in seq_along(colGroups)) 
  finalMatrix[ , colGroups[jj]] = 
    finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]

I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).

Efficiently combine (AND) groups of columns in a logical matrix

Answers (2)

Related Questions