dothatrumba
dothatrumba

Reputation: 67

Return multiple columns within a data.table while aggregating using .()/list and by

In the example below, I have a function (meansd) that returns the mean of a vector and optionally returns the standard deviation as well. When combining [.data.table, the ./list syntax in j and by the ideal result that is long on by, but wide on metric (e.g. the mean and sd returned by the function) as shown in the dcast result.

Is it possible to get the result of the dcast call directly from a call to [.data.table similar to a[, .(meansd(var1, T), meansd(var1, T)), by = bvar] (and therefore skip the dcast step)? Or does the error from the last example (combining transpose and by) mean I'm out of luck?

library('data.table')

a = data.table(var1 = c(1,2,3,10,123,12,31,4,6,2), bvar = 1:2)

meansd = function(x, sd = TRUE){
  if(!sd) return(mean(x))
  return(list(mean = mean(x), sd = sd(x)))
  
}

#Presumable syntax:
a[, .(meansd(var1, T), meansd(var1, T)), by = bvar] #data are long on metric which is not ideal
#>    bvar       V1       V2
#> 1:    1     32.8     32.8
#> 2:    1  51.8575  51.8575
#> 3:    2        6        6
#> 4:    2 4.690416 4.690416

#What I was hoping the output would be (dcast call results):
ideal = a[, .(meansd(var1, T), meansd(var1, T)), by = bvar]
ideal[, metric := rep(c('mean', 'sd'),2)]
#ideally the output would look something like this, without the need for the dcast:
dcast(ideal,bvar~metric, value.var = c('V1','V2')) #names are nice, but unimportant
#>    bvar V1_mean    V1_sd V2_mean    V2_sd
#> 1:    1    32.8  51.8575    32.8  51.8575
#> 2:    2       6 4.690416       6 4.690416

Some related things I've tried:

#some success when only operating on one column:
a[, .(meansd(var1, F))] #returned vector becomes a data.table per `.()` syntax
#>      V1
#> 1: 19.4

a[, .(meansd(var1, T))] #why is this long? Is it possible to be wide? Why does the inclusion of `.()` change it?
#>          V1
#> 1:     19.4
#> 2: 37.47651

a[, .(t(meansd(var1, T)))] #transpose works alright
#>    mean       sd
#> 1: 19.4 37.47651

a[, .(t(meansd(var1, T))), by = bvar] #by does not like a transpose
#> Error in `[.data.table`(a, , .(t(meansd(var1, T))), by = bvar): All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.

Upvotes: 0

Views: 49

Answers (1)

Vincent
Vincent

Reputation: 17715

If you concatenate your two calls to meansd into a single vector and then transpose it, you'll get 4 separate columns.

Then you just need to clean up by renaming your columns:

library('data.table')

a = data.table(var1 = c(1,2,3,10,123,12,31,4,6,2), bvar = 1:2)

meansd = function(x, sd = TRUE){
  if(!sd) return(mean(x))
  return(list(mean = mean(x), sd = sd(x)))
}

a = a[, t(c(meansd(var1, TRUE), meansd(var1, TRUE))), by=bvar]
colnames(a) = c("bvar", "V1_mean", "V1_sd", "V2_mean", "V2_sd")

a
#>    bvar V1_mean     V1_sd V2_mean     V2_sd
#> 1:    1    32.8 51.857497    32.8 51.857497
#> 2:    2     6.0  4.690416     6.0  4.690416

Upvotes: 1

Related Questions