Reputation: 9809
I have a matrix with a grouping column and a column to sum up values.
I want to split the matrix into groups, sum up some values (by keeping the same length of vectors), unsplit it and assign them to a new column?
What is the most performant and base-R-onic way of doing that?
The winner for now is an lapply
function, but I am wondering if there is another function I am missing? Something like stats::aggregate
which keeps the same structure?
I would like to stay in base R and keep matrices, so no dplyr
or data.table
;).
Edit1: I included the aggregate + merge and the sapply solution powered by @IceCreamToucan. (Thanks for that). Its not a very fair comparison for aggregate
as I first convert to data.frames and then back to matrices.
Edit2: With bigger matrices and 100 groups ave
outperforms the other functions. Thanks to @Gregor for this one.
set.seed(104)
smpl = sample(1:100, size = 10000, T)
mat0 <- as.matrix(data.frame(
group=smpl,
sum=sample(seq(0,100,10), length(smpl), T)
))
mat1 <- cbind(mat0, "sums"=0)
library(microbenchmark)
check <- function(values) {
all(sapply(values[-1], function(x) all.equal(values[[1]], x)))}
mf = microbenchmark(#check = check,
forloop = {
mat <- mat1
for (z in unique(mat[,'group'])) {
mat[mat[,'group'] == z,'sums'] = sum(mat[mat[,'group'] == z,'sum'])
}
mat
},
lapply = {
mat <- mat1
mat[,'sums'] <- unlist(lapply(unique(mat[,'group']), function(i) {
sums = sum(mat[mat[,'group'] == i,'sum'])
rep(sums, length(mat[mat[,'group'] == i,'sum']))
}))
mat
},
sapply = {
mat <- mat1
mat <- mat[order(mat[,'group']),]
mat[,'sums'] <- rep(sapply(split(mat[, 'sum'], mat[, 'group']), sum),
table(mat[, 'group']))
mat
},
ave = {
mat <- mat1
mat[,'sums'] <- ave(x = mat[, 'sum'], mat[, 'group'], FUN = sum)
mat[order(mat[,'group']),]
},
aggregate = {
matA <- mat0
matA <- matA[order(matA[,'group']),]
res = aggregate(sum ~ group, FUN = sum, data = matA)
matdf = data.frame(matA)
base::merge(res, matdf, by ="group")
}
)
mf
Unit: milliseconds expr min lq mean median uq max neval cld forloop 19.94083 25.73131 25.95823 25.97898 26.58043 38.68300 100 bc lapply 15.96057 21.44226 24.23693 21.88130 22.41287 311.00252 100 bc sapply 21.89081 22.41981 23.42291 22.70492 23.04978 37.41853 100 b ave 11.79256 12.08868 12.51119 12.27613 12.52803 18.20577 100 a aggregate 26.54753 27.31484 29.09592 27.71163 28.71937 54.75284 100 c
Upvotes: 1
Views: 285
Reputation: 145965
Consulting various R-FAQ (how to sum by group?, Grouping functions and the *apply family), the base R function for the purpose of summing by group without aggregation is ave
:
ave(x = mat1[, 'sum'], mat1[, 'group'], FUN = sum)
As edited into the question, ave
is quite fast when there are lots of groups.
Upvotes: 1