Aggregate calculations with and without grouping variable in data.table

Question

I'm producing some summary statistics at the by-group and overall levels.

(Note: the overall statistic cannot necessarily be derived from the group-level stats. A weighted average could work, but not a median.)

Thus far my workarounds use rbindlist on either summary stats or copies of the original data, as in:

library(data.table)
data(iris)

d <- data.table(iris)

# Approach 1)

rbindlist(list(d[, lapply(.SD, median),  by=Species, .SDcols=c('Sepal.Length','Petal.Length')],
               d[, lapply(.SD, median),  .SDcols=c('Sepal.Length', 'Petal.Length')]),
      fill=TRUE)
#       Species Sepal.Length Petal.Length
# 1:     setosa          5.0         1.50
# 2: versicolor          5.9         4.35
# 3:  virginica          6.5         5.55
# 4:         NA          5.8         4.35

# Approach 2)

d2 <- rbindlist(list(copy(d), copy(d[,Species:="Overall"]) ) )
d2[, lapply(.SD, median),  by=Species, .SDcols=c('Sepal.Length', 'Petal.Length')]
#       Species Sepal.Length Petal.Length
# 1:     setosa          5.0         1.50
# 2: versicolor          5.9         4.35
# 3:  virginica          6.5         5.55
# 4:    Overall          5.8         4.35

The first approach seems to be faster (avoids copies).

The second approach allows me to use a label "Overall" instead of the NA fill, which is more intelligible if some records were missing the "Species" value (which in the first approach would result in two rows of NA Species.)

Are there any other solutions I should consider?

eddi · Accepted Answer

I think I normally do it like this:

cols = c('Sepal.Length','Petal.Length')

rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
      d[, lapply(.SD, median), .SDcols=cols][, Species := 'Overall'])
#      Species Sepal.Length Petal.Length
#1:     setosa          5.0         1.50
#2: versicolor          5.9         4.35
#3:  virginica          6.5         5.55
#4:    Overall          5.8         4.35

Aggregate calculations with and without grouping variable in data.table

Answers (2)

Related Questions