Reputation: 827
I need to take column sums over a large range of select columns. For example:
library(data.table)
set.seed(123)
DT = data.table(grp = c("A", "B", "C"),
x1 = sample(1:10, 3),
x2 = sample(1:10, 3),
x3 = sample(1:10, 3),
x4 = sample(1:10, 3))
> DT
grp x1 x2 x3 x4
1: A 3 9 6 5
2: B 8 10 9 9
3: C 4 1 5 4
Say, I want to sum over x2
and x3
. I would normally do this using:
> DT[, .(total = sum(x2, x3)), by=grp]
grp total
1: A 15
2: B 19
3: C 6
However, if the range of columns is very large, say 100, how can this be coded elegantly, without spelling each column by name?
What I tried (and what didn't work):
my_cols <- paste0("x", 2:3)
DT[, .(total = sum(get(my_cols))), by=grp]
grp total
1: A 9
2: B 10
3: C 1
Appears to use only the first column (x2
) and disregard the rest.
Upvotes: 1
Views: 197
Reputation: 92292
I didn't find an exact dupe (that deals with sum by row by group) so here 5 different possibilities I could think off.
The main thing to remember here that you are working with a data.table per group, hence, some functions won't work without unlist
## Create an example data
library(data.table)
set.seed(123)
DT <- data.table(grp = c("A", "B", "C"),
matrix(sample(1:10, 30 * 4, replace = TRUE), ncol = 4))
my_cols <- paste0("V", 2:3)
## 1- This won't work with `NA`s. It will work without `unlist`,
## but won't return correct results.
DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp]
## 2 - Convert to long format first and then aggregate
melt(DT, "grp", measure = my_cols)[, sum(value), by = grp]
## 3 - Using `base::sum` which can handle data.frames,
## see `?S4groupGeneric` (a data.table is also a data.frame)
DT[, base::sum(.SD), .SDcols = my_cols, by = grp]
## 4 - This will use data.tables enhanced `gsum` function,
## but it can't handle data.frames/data.tables
## Hence, requires unlist first. Will be interesting to measure the tradeoff
DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp]
## 5 - This is a modification to your original attempt that both handles multiple columns
## (`mget` instead of `get`) and adds `unlist`
## (no point trying wuth `base::sum` instead, because it will also require `unlist`)
DT[, sum(unlist(mget(my_cols))), by = grp]
All of these will return the same result
# grp V1
# 1: A 115
# 2: B 105
# 3: C 96
Some benchmarks
library(data.table)
library(microbenchmark)
library(stringi)
set.seed(123)
N <- 1e5
cols <- 50
DT <- data.table(grp = stri_rand_strings(N / 1e4, 2),
matrix(sample(1:10, N * cols, replace = TRUE),
ncol = cols))
my_cols <- paste0("V", 1:20)
mbench <- microbenchmark(
"Reduce/unlist: " = DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp],
"melt: " = melt(DT, "grp", measure = my_cols)[, sum(value), by = grp],
"base::sum: " = DT[, base::sum(.SD), .SDcols = my_cols, by = grp],
"gsum/unlist: " = DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp],
"gsum/mget/unlist: " = DT[, sum(unlist(mget(my_cols))), by = grp]
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Reduce/unlist: 1968.93628 2185.45706 2332.66770 2301.10293 2440.43138 3161.15522 100 c
# melt: 33.91844 58.18254 66.70419 64.52190 74.29494 132.62978 100 a
# base::sum: 18.00297 22.44860 27.21083 25.14174 29.20080 77.62018 100 a
# gsum/unlist: 780.53878 852.16508 929.65818 894.73892 968.28680 1430.91928 100 b
# gsum/mget/unlist: 797.99854 876.09773 963.70562 928.27375 1003.04632 1578.76408 100 b
library(ggplot2)
autoplot(mbench)
Upvotes: 3