C8H10N4O2
C8H10N4O2

Reputation: 18995

Aggregate calculations with and without grouping variable in data.table

I'm producing some summary statistics at the by-group and overall levels.

(Note: the overall statistic cannot necessarily be derived from the group-level stats. A weighted average could work, but not a median.)

Thus far my workarounds use rbindlist on either summary stats or copies of the original data, as in:

library(data.table)
data(iris)

d <- data.table(iris)

# Approach 1)

rbindlist(list(d[, lapply(.SD, median),  by=Species, .SDcols=c('Sepal.Length','Petal.Length')],
               d[, lapply(.SD, median),  .SDcols=c('Sepal.Length', 'Petal.Length')]),
      fill=TRUE)
#       Species Sepal.Length Petal.Length
# 1:     setosa          5.0         1.50
# 2: versicolor          5.9         4.35
# 3:  virginica          6.5         5.55
# 4:         NA          5.8         4.35

# Approach 2)

d2 <- rbindlist(list(copy(d), copy(d[,Species:="Overall"]) ) )
d2[, lapply(.SD, median),  by=Species, .SDcols=c('Sepal.Length', 'Petal.Length')]
#       Species Sepal.Length Petal.Length
# 1:     setosa          5.0         1.50
# 2: versicolor          5.9         4.35
# 3:  virginica          6.5         5.55
# 4:    Overall          5.8         4.35

The first approach seems to be faster (avoids copies).

The second approach allows me to use a label "Overall" instead of the NA fill, which is more intelligible if some records were missing the "Species" value (which in the first approach would result in two rows of NA Species.)

Are there any other solutions I should consider?

Upvotes: 2

Views: 248

Answers (2)

C8H10N4O2
C8H10N4O2

Reputation: 18995

I accepted @Eddi's answer but wanted to incorporate the good comment from @Frank. This approach IMO makes the most sense.

library(data.table)
d <- data.table(iris)
cols = c('Sepal.Length','Petal.Length')

rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
      d[, c(Species = 'Overall', lapply(.SD, median) ), .SDcols=cols])
#       Species Sepal.Length Petal.Length
# 1:     setosa          5.0         1.50
# 2: versicolor          5.9         4.35
# 3:  virginica          6.5         5.55
# 4:    Overall          5.8         4.35

It may also be slightly faster (1.54 vs. 1.73 millis on microbenchmark) than applying the secondary calculation.

Upvotes: 2

eddi
eddi

Reputation: 49448

I think I normally do it like this:

cols = c('Sepal.Length','Petal.Length')

rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
      d[, lapply(.SD, median), .SDcols=cols][, Species := 'Overall'])
#      Species Sepal.Length Petal.Length
#1:     setosa          5.0         1.50
#2: versicolor          5.9         4.35
#3:  virginica          6.5         5.55
#4:    Overall          5.8         4.35

Upvotes: 3

Related Questions