Reputation: 4899
I am trying to apply multiple functions to multiple columns of a data.table
. Example:
DT <- data.table("a"=1:5,
"b"=2:6,
"c"=3:7)
Let's say I want to get the mean and the median of columns a
and b
.
This works:
stats <- DT[,.(mean_a=mean(a),
median_a=median(a),
mean_b=mean(b),
median_b=median(b))]
But it is way too repetitive. Is there a nice way to achieve a similar result using .SDcols
and lapply
?
Upvotes: 39
Views: 16134
Reputation: 10437
Other answers show how to do it, but no one bothered to explain the basic principle. The basic rule is that elements of lists returned by j
expressions form the columns of the resulting data.table
. Any j
expression that produces a list, each element of which corresponds to a desired column in the result, will work. With this in mind we can use
DT[, c(mean = lapply(.SD, mean),
median = lapply(.SD, median)),
.SDcols = c('a', 'b')]
## mean.a mean.b median.a median.b
## 1: 3 4 3 4
or
DT[, unlist(lapply(.SD,
function(x) list(mean = mean(x),
median = median(x))),
recursive = FALSE),
.SDcols = c('a', 'b')]
## a.mean a.median b.mean b.median
## 1: 3 3 4 4
depending on the desired order.
Importantly we can use any method we want to produce the desired result, provided only that we arrange the result into a list as described above. For example,
library(matrixStats)
DT[, c(mean = as.list(colMeans(.SD)),
median = setNames(as.list(colMedians(as.matrix(.SD))), names(.SD))),
.SDcols = c('a', 'b')]
## mean.a mean.b median.a median.b
## 1: 3 4 3 4
also works.
Upvotes: 16
Reputation: 1938
This might be a little over-engineered, but if you come from dplyr's summarize_at()
you might want to have a similar structured result.
First define a function lapply_at()
which takes a .SD
and a character vector of function names as inputs. Then you can easily compute your desired statistics and get a readable result.
library(data.table)
iris_dt <- as.data.table(iris)
lapply_at <- function(var, funs, ...) {
results <- sapply(var, function(var) {
lapply(funs, do.call, list(var, ...))
})
names(results) <- vapply(names(var), paste, funs, sep = "_",
FUN.VALUE = character(length(funs)),
USE.NAMES = FALSE)
results
}
iris_dt[, lapply_at(.SD, c("mean", "sd"), na.rm = TRUE),
.SDcols = patterns("^Sepal"),
by = Species]
#> Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean
#> 1: setosa 5.006 0.3524897 3.428
#> 2: versicolor 5.936 0.5161711 2.770
#> 3: virginica 6.588 0.6358796 2.974
#> Sepal.Width_sd
#> 1: 0.3790644
#> 2: 0.3137983
#> 3: 0.3224966
Created on 2019-07-03 by the reprex package (v0.2.0).
Upvotes: 3
Reputation: 31
use dcast
DT$dday <- 1 # add a constant column
dt <- dcast(DT, dday~dday, fun=list(sum, mean), value.var = c('a', 'b'))
# dday a_sum_1 b_sum_1 a_mean_1 b_mean_1
# 1 15 20 3 4
In fact, we can use dcast to implement onehot and feature engineer.
Upvotes: 3
Reputation: 49448
I'd normally do this:
my.summary = function(x) list(mean = mean(x), median = median(x))
DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median b.mean b.median
# 3 3 4 4
Upvotes: 39
Reputation: 31181
This is a little bit clumsy but does the job with data.table
:
funcs = c('median', 'mean', 'sum')
m = DT[, lapply(.SD, function(u){
sapply(funcs, function(f) do.call(f,list(u)))
})][, t(.SD)]
colnames(m) = funcs
# median mean sum
#a 3 3 15
#b 4 4 20
#c 5 5 25
Upvotes: 12