paljenczy
paljenczy

Reputation: 4899

Apply multiple functions to multiple columns in data.table

I am trying to apply multiple functions to multiple columns of a data.table. Example:

DT <- data.table("a"=1:5,
                 "b"=2:6,
                 "c"=3:7)

Let's say I want to get the mean and the median of columns a and b. This works:

stats <- DT[,.(mean_a=mean(a),
               median_a=median(a),
               mean_b=mean(b),
               median_b=median(b))]

But it is way too repetitive. Is there a nice way to achieve a similar result using .SDcols and lapply?

Upvotes: 39

Views: 16134

Answers (5)

Ista
Ista

Reputation: 10437

Other answers show how to do it, but no one bothered to explain the basic principle. The basic rule is that elements of lists returned by j expressions form the columns of the resulting data.table. Any j expression that produces a list, each element of which corresponds to a desired column in the result, will work. With this in mind we can use

DT[, c(mean = lapply(.SD, mean),
       median = lapply(.SD, median)),
  .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

or

DT[, unlist(lapply(.SD,
                   function(x) list(mean = mean(x),
                                    median = median(x))),
            recursive = FALSE),
   .SDcols = c('a', 'b')]
##    a.mean a.median b.mean b.median
## 1:      3        3      4        4

depending on the desired order.

Importantly we can use any method we want to produce the desired result, provided only that we arrange the result into a list as described above. For example,

library(matrixStats)
DT[, c(mean = as.list(colMeans(.SD)),
       median = setNames(as.list(colMedians(as.matrix(.SD))), names(.SD))),
   .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

also works.

Upvotes: 16

der_grund
der_grund

Reputation: 1938

This might be a little over-engineered, but if you come from dplyr's summarize_at() you might want to have a similar structured result.

First define a function lapply_at() which takes a .SD and a character vector of function names as inputs. Then you can easily compute your desired statistics and get a readable result.

library(data.table)
iris_dt <- as.data.table(iris)

lapply_at <- function(var, funs, ...) {
  results <- sapply(var, function(var) {
    lapply(funs, do.call, list(var, ...))
  })
  names(results) <- vapply(names(var), paste, funs, sep = "_", 
                           FUN.VALUE = character(length(funs)),
                           USE.NAMES = FALSE)
  results
}

iris_dt[, lapply_at(.SD, c("mean", "sd"), na.rm = TRUE), 
        .SDcols = patterns("^Sepal"),
        by = Species]

#>       Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean
#> 1:     setosa             5.006       0.3524897            3.428
#> 2: versicolor             5.936       0.5161711            2.770
#> 3:  virginica             6.588       0.6358796            2.974
#>    Sepal.Width_sd
#> 1:      0.3790644
#> 2:      0.3137983
#> 3:      0.3224966

Created on 2019-07-03 by the reprex package (v0.2.0).

Upvotes: 3

Hilbert sun
Hilbert sun

Reputation: 31

use dcast

DT$dday <- 1 # add a constant column
dt <- dcast(DT, dday~dday, fun=list(sum, mean), value.var = c('a', 'b'))
# dday a_sum_1 b_sum_1 a_mean_1 b_mean_1
# 1      15      20        3        4

In fact, we can use dcast to implement onehot and feature engineer.

Upvotes: 3

eddi
eddi

Reputation: 49448

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median   b.mean b.median 
#     3        3        4        4 

Upvotes: 39

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

This is a little bit clumsy but does the job with data.table:

funcs = c('median', 'mean', 'sum')

m = DT[, lapply(.SD, function(u){
        sapply(funcs, function(f) do.call(f,list(u)))
     })][, t(.SD)]
colnames(m) = funcs

#  median mean sum
#a      3    3  15
#b      4    4  20
#c      5    5  25

Upvotes: 12

Related Questions