Reputation: 18995
I'm producing some summary statistics at the by-group and overall levels.
(Note: the overall statistic cannot necessarily be derived from the group-level stats. A weighted average could work, but not a median.)
Thus far my workarounds use rbindlist
on either summary stats or copies of the original data, as in:
library(data.table)
data(iris)
d <- data.table(iris)
# Approach 1)
rbindlist(list(d[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length','Petal.Length')],
d[, lapply(.SD, median), .SDcols=c('Sepal.Length', 'Petal.Length')]),
fill=TRUE)
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: NA 5.8 4.35
# Approach 2)
d2 <- rbindlist(list(copy(d), copy(d[,Species:="Overall"]) ) )
d2[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length', 'Petal.Length')]
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
The first approach seems to be faster (avoids copies).
The second approach allows me to use a label "Overall" instead of the NA
fill, which is more intelligible if some records were missing the "Species" value (which in the first approach would result in two rows of NA
Species.)
Are there any other solutions I should consider?
Upvotes: 2
Views: 248
Reputation: 18995
I accepted @Eddi's answer but wanted to incorporate the good comment from @Frank. This approach IMO makes the most sense.
library(data.table)
d <- data.table(iris)
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, c(Species = 'Overall', lapply(.SD, median) ), .SDcols=cols])
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
It may also be slightly faster (1.54 vs. 1.73 millis on microbenchmark
) than applying the secondary calculation.
Upvotes: 2
Reputation: 49448
I think I normally do it like this:
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, lapply(.SD, median), .SDcols=cols][, Species := 'Overall'])
# Species Sepal.Length Petal.Length
#1: setosa 5.0 1.50
#2: versicolor 5.9 4.35
#3: virginica 6.5 5.55
#4: Overall 5.8 4.35
Upvotes: 3