Walker in the City
Walker in the City

Reputation: 587

Efficiently combine (AND) groups of columns in a logical matrix

I am looking for an efficient way to combine selected columns in a logical matrix by "ANDing" them together and ending up with a new matrix. An example of what I am looking for:

matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE  TRUE FALSE  TRUE
[2,]  TRUE FALSE  TRUE  TRUE
[3,] FALSE  TRUE  TRUE FALSE
[4,]  TRUE  TRUE FALSE  TRUE
[5,]  TRUE FALSE  TRUE  TRUE
[6,] FALSE  TRUE  TRUE FALSE

The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix), where the columns to be grouped together ANDed have the same value (a value from 1 to n, where n <= ncol(exampleMatrix) and every value in 1:n is used at least once). The resulting matrix should have the columns in order from 1:n. For example, if the vector that specifies the column groups is

colGroups <- c(3, 2, 2, 1)

Then the resulting matrix would be

      [,1]  [,2]  [,3]
[1,]  TRUE FALSE  TRUE
[2,]  TRUE FALSE  TRUE
[3,] FALSE  TRUE FALSE
[4,]  TRUE FALSE  TRUE
[5,]  TRUE FALSE  TRUE
[6,] FALSE  TRUE FALSE

Where in the resulting matrix

[,1] = exampleMatrix[,4] 
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]

My current way of doing this looks basically like this:

finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
    selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
    finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}

Where rowSums(selectedColumns)==ncol(selectedColumns) is an efficient way to AND all of the columns of a matrix together.

My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for loop many times but instead it is the operation in the loop that is slow.

In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc() within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!

For a more representative example, this is a much larger exampleMatrix:

matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)

Upvotes: 3

Views: 263

Answers (2)

MichaelChirico
MichaelChirico

Reputation: 34703

From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups (30% improvement over your suggestion):

for (jj in seq_along(colGroups)) 
  finalMatrix[ , colGroups[jj]] = 
    finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]

I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).

Upvotes: 4

thelatemail
thelatemail

Reputation: 93813

As far as I can tell, this is an aggregation across columns using the all function. So if you transpose to rows, then use colGroups as the grouping factor to apply all, then transpose back to columns, you should get the intended result:

t(aggregate(t(exampleMatrix), list(colGroups), FUN=all)[-1])

#    [,1]  [,2]  [,3]
#V1  TRUE FALSE  TRUE
#V2  TRUE FALSE  TRUE
#V3 FALSE  TRUE FALSE
#V4  TRUE FALSE  TRUE
#V5  TRUE FALSE  TRUE
#V6 FALSE  TRUE FALSE

The [-1] just drops the group-identifier variable which you don't require in the final output.

If you're working with stupid big data, the by-group aggregation could be done in data.table as well:

library(data.table)
t(as.data.table(t(exampleMatrix))[, lapply(.SD,all), by=colGroups][,-1])

Upvotes: 2

Related Questions