Reputation: 587
I am looking for an efficient way to combine selected columns in a logical matrix by "AND
ing" them together and ending up with a new matrix. An example of what I am looking for:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE FALSE TRUE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE TRUE TRUE FALSE
[4,] TRUE TRUE FALSE TRUE
[5,] TRUE FALSE TRUE TRUE
[6,] FALSE TRUE TRUE FALSE
The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix)
, where the columns to be grouped together ANDed have the same value (a value from 1
to n
, where n <= ncol(exampleMatrix)
and every value in 1:n
is used at least once). The resulting matrix should have the columns in order from 1:n
. For example, if the vector that specifies the column groups is
colGroups <- c(3, 2, 2, 1)
Then the resulting matrix would be
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
[5,] TRUE FALSE TRUE
[6,] FALSE TRUE FALSE
Where in the resulting matrix
[,1] = exampleMatrix[,4]
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]
My current way of doing this looks basically like this:
finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}
Where rowSums(selectedColumns)==ncol(selectedColumns)
is an efficient way to AND all of the columns of a matrix together.
My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply
in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for
loop many times but instead it is the operation in the loop that is slow.
In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc()
within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!
For a more representative example, this is a much larger exampleMatrix
:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)
Upvotes: 3
Views: 263
Reputation: 34703
From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups
(30% improvement over your suggestion):
for (jj in seq_along(colGroups))
finalMatrix[ , colGroups[jj]] =
finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]
I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).
Upvotes: 4
Reputation: 93813
As far as I can tell, this is an aggregation across columns using the all
function. So if you t
ranspose to rows, then use colGroups
as the grouping factor to apply all
, then t
ranspose back to columns, you should get the intended result:
t(aggregate(t(exampleMatrix), list(colGroups), FUN=all)[-1])
# [,1] [,2] [,3]
#V1 TRUE FALSE TRUE
#V2 TRUE FALSE TRUE
#V3 FALSE TRUE FALSE
#V4 TRUE FALSE TRUE
#V5 TRUE FALSE TRUE
#V6 FALSE TRUE FALSE
The [-1]
just drops the group-identifier variable which you don't require in the final output.
If you're working with stupid big data, the by-group aggregation could be done in data.table
as well:
library(data.table)
t(as.data.table(t(exampleMatrix))[, lapply(.SD,all), by=colGroups][,-1])
Upvotes: 2