Reputation: 282
I need to count the number of values occurrences in the entire matrix.
My matrix consists only of "0"
"1"
"2"
and I need my result to be the sum of occurrences of each of the values
If this is my matrix:
The result should be:
0 -> 10
1 -> 9
2-> 11
I am looking for an answer that uses apply. I feel like I've searched half of the internet but I didn't stumble across an answer that would be understandable or recreatable.
I ultimately want to make this apply work parallel but I think I know how to do this.
I should mention that my matrix is 30e6x18 and table is not working because of the memory issue I guess
Upvotes: 1
Views: 3234
Reputation: 11336
If your matrix is large, then it makes sense to compute counts rowwise or columnwise to conserve memory. apply
is a valid way to go about this.
Conceptually, this answer is not unlike the one I provided here for data frames. I will once again recommend that you use tabulate
instead of table
; it is really much more efficient.
set.seed(1L)
m <- 5L
n <- 4L
A <- matrix(sample(c("0", "1", "2"), size = m * n, replace = TRUE), m, n)
A
[,1] [,2] [,3] [,4]
[1,] "0" "2" "2" "1"
[2,] "2" "2" "0" "1"
[3,] "0" "1" "0" "1"
[4,] "1" "1" "0" "2"
[5,] "0" "2" "1" "0"
f <- function(x, levels) tabulate(factor(x, levels), length(levels))
rowSums(apply(A, 1L, f, c("0", "1", "2"))) # if 'm' has more columns than rows
## [1] 7 7 6
rowSums(apply(A, 2L, f, c("0", "1", "2"))) # if 'm' has more rows than columns
## [1] 7 7 6
You are going to want apply
to loop over the smaller dimension of your matrix, so choose the second argument accordingly. If your matrix actually has millions of rows and only 18 columns, then use the second statement above, not the first.
Here is a test using a matrix with your dimensions. It only takes ~10 seconds on my machine, so parallelization might be overkill.
set.seed(1L)
m <- 3e+07L
n <- 18L
A <- matrix(sample(c("0", "1", "2"), m * n, replace = TRUE), m, n)
system.time(rowSums(apply(A, 2L, f, c("0", "1", "2"))))
## user system elapsed
## 8.195 2.816 12.322
Just for fun:
library("parallel")
system.time(Reduce(`+`, mclapply(seq_len(n), function(i) f(A[, i], c("0", "1", "2")), mc.cores = 4L)))
## user system elapsed
## 3.924 0.904 3.497
Upvotes: 3
Reputation: 887741
Usual way would be to unlist
(if it is a data.frame
) and apply the table
or concatenate (c
- if it is matrix
to convert to vector) and apply the table
.
table(unlist(m1))
0 1 2
10 9 11
If the dataset is really big, we can loop. Here, the number of columns seems to be 30e6
and the number of rows as 18. We may loop over the rows with a for
loop and update on a temporary dataset created
out <- setNames(rep(0, 3), 0:2)
for(i in seq_len(nrow(m1))) {
tmp <- table(m1[i, ])
out[names(tmp)] <- out[names(tmp)] + tmp
}
out
0 1 2
10 9 11
If we need a parallel
option, can use dapply
with parallel = TRUE
, and do a colSums
on the output generated
library(collapse)
colSums(dapply(m1, MARGIN = 1, FUN = \(x)
table(factor(x, levels = 0:2)), parallel = TRUE))
X1 X2 X3
10 9 11
m1 <- structure(c(2, 1, 1, 2, 2, 0, 1, 1, 2, 1, 1, 0, 0, 2, 0, 2, 0,
1, 2, 2, 0, 0, 0, 2, 0, 1, 2, 2, 0, 1), .Dim = c(10L, 3L), .Dimnames = list(
NULL, c("X1", "X2", "X3")))
Upvotes: 3