Reputation: 1111
I'm trying to get multiple summary statistics in R/S-PLUS grouped by categorical column in one shot. I found couple of functions, but all of them do one statistic per call, like aggregate()
.
data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66,
71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)
mg <- aggregate(df$dt, by=df$group, FUN=mean)
mg <- aggregate(df$dt, by=df$group, FUN=sum)
What I'm looking for is to get multiple statistics for the same group like mean, min, max, std, ...etc in one call, is that doable?
Upvotes: 111
Views: 336507
Reputation: 52349
collapse
offers a very flexible function for summary statistics with qsu
:
library(collapse)
with(df, qsu(dt, g = group))
# N Mean SD Min Max
# A 4 61 1.8257 59 63
# B 6 66 2.8284 63 71
# C 6 68 1.6733 66 71
# D 8 61 2.6186 56 64
It's also very fast:
microbenchmark::microbenchmark(
tapply = tapply(df$dt, df$group, summary),
dt = setDT(df)[, as.list(summary(dt)), by = group],
collapse = qsu(df$dt, g = df$group),
purrr = df %>% split(.$group) %>% purrr::map(summary)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# tapply 453.2 503.75 531.718 522.70 548.6 946.8 100
# dt 998.8 1076.90 1288.057 1127.55 1205.9 9569.6 100
# collapse 14.8 24.45 38.432 36.90 43.9 121.6 100
# purrr 2553.6 2728.85 2847.378 2816.75 2940.8 3715.8 100
Upvotes: 1
Reputation: 163
I would also recommend gtsummary (written by Daniel D. Sjoberg et al). You can generate publication-ready or presentation-ready tables with the package. A gtsummary solution to the example given in the question would be:
library(tidyverse)
library(gtsummary)
data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66,
71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)
tbl_summary(df,
by=group,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>%
add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))
which then gives you the output below
Characteristic | A, N = 4 | B, N = 6 | C, N = 6 | D, N = 8 |
---|---|---|---|---|
dt | ||||
Mean (SD) | 61.0 (1.8) | 66.0 (2.8) | 68.0 (1.7) | 61.0 (2.6) |
Meian (IQR) | 61.0 (2.5) | 65.5 (2.5) | 68.0 (0.8) | 61.5 (3.2) |
Min- Max | 59.0 - 63.0 | 63.0 - 71.0 | 66.0 - 71.0 | 56.0 - 64.0 |
You can also export the table as word document by doing the following:
Table1 <- tbl_summary(df,
by=group,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({IQR})", "{min}- {max}"), ) %>%
add_stat_label(label = dt ~ c("Mean (SD)","Median (Inter Quant. Range)", "Min- Max"))
tmp1 <- "~path/name.docx"
Table1 %>%
as_flex_table() %>%
flextable::save_as_docx(path=tmp1)
You can use it for regression outputs as well. See the package reference manual and the package webpage for further insights
https://cran.r-project.org/web/packages/gtsummary/index.html https://www.danieldsjoberg.com/gtsummary/index.html
Upvotes: 2
Reputation: 11232
With more recent (>1.0) versions of dplyr
you can do so with
iris %>%
group_by(Species) %>%
summarise(as_tibble(rbind(summary(Sepal.Length))))
This works because dplyr will unpack the result of summarise
into columns if the argument evaluates into a dataframe.
Upvotes: 6
Reputation: 5269
Not sure why the popular skimr
package hasn’t been brought up. Their function skim()
was meant to replace the base R summary()
and supports dplyr
grouping:
library(dplyr)
library(skimr)
starwars %>%
group_by(gender) %>%
skim()
#> ── Data Summary ────────────────────────
#> Values
#> Name Piped data
#> Number of rows 87
#> Number of columns 14
#> _______________________
#> Column type frequency:
#> character 7
#> list 3
#> numeric 3
#> ________________________
#> Group variables gender
#>
#> ── Variable type: character ──────────────────────────────────────────────────────
#> skim_variable gender n_missing complete_rate min max empty n_unique
#> 1 name feminine 0 1 3 18 0 17
#> 2 name masculine 0 1 3 21 0 66
#> 3 name <NA> 0 1 8 14 0 4
#> 4 hair_color feminine 0 1 4 6 0 6
#> 5 hair_color masculine 5 0.924 4 13 0 9
#> 6 hair_color <NA> 0 1 4 7 0 4
#> # [...]
#>
#> ── Variable type: list ───────────────────────────────────────────────────────────
#> skim_variable gender n_missing complete_rate n_unique min_length max_length
#> 1 films feminine 0 1 9 1 5
#> 2 films masculine 0 1 24 1 7
#> 3 films <NA> 0 1 3 1 2
#> 4 vehicles feminine 0 1 3 0 1
#> 5 vehicles masculine 0 1 9 0 2
#> 6 vehicles <NA> 0 1 1 0 0
#> # [...]
#>
#> ── Variable type: numeric ────────────────────────────────────────────────────────
#> skim_variable gender n_missing complete_rate mean sd p0 p25 p50
#> 1 height feminine 1 0.941 165. 23.6 96 162. 166.
#> 2 height masculine 4 0.939 177. 37.6 66 171. 183
#> 3 height <NA> 1 0.75 181. 2.89 178 180. 183
#> # [...]
Upvotes: 7
Reputation: 19454
tapply
I'll put in my two cents for tapply()
.
tapply(df$dt, df$group, summary)
You could write a custom function with the specific statistics you want or format the results:
tapply(df$dt, df$group,
function(x) format(summary(x), scientific = TRUE))
$A
Min. 1st Qu. Median Mean 3rd Qu. Max.
"5.900e+01" "5.975e+01" "6.100e+01" "6.100e+01" "6.225e+01" "6.300e+01"
$B
Min. 1st Qu. Median Mean 3rd Qu. Max.
"6.300e+01" "6.425e+01" "6.550e+01" "6.600e+01" "6.675e+01" "7.100e+01"
$C
Min. 1st Qu. Median Mean 3rd Qu. Max.
"6.600e+01" "6.725e+01" "6.800e+01" "6.800e+01" "6.800e+01" "7.100e+01"
$D
Min. 1st Qu. Median Mean 3rd Qu. Max.
"5.600e+01" "5.975e+01" "6.150e+01" "6.100e+01" "6.300e+01" "6.400e+01"
data.table
The data.table
package offers a lot of helpful and fast tools for these types of operation:
library(data.table)
setDT(df)
> df[, as.list(summary(dt)), by = group]
group Min. 1st Qu. Median Mean 3rd Qu. Max.
1: A 59 59.75 61.0 61 62.25 63
2: B 63 64.25 65.5 66 66.75 71
3: C 66 67.25 68.0 68 68.00 71
4: D 56 59.75 61.5 61 63.00 64
Upvotes: 152
Reputation: 2863
this may also work,
spl <- split(mtcars, mtcars$cyl)
list.of.summaries <- lapply(spl, function(x) data.frame(apply(x[,3:6], 2, summary)))
list.of.summaries
Upvotes: 1
Reputation: 107
The psych
package has a great option for grouped summary stats:
library(psych)
describeBy(dt, group="grp")
produces lots of useful stats including mean, median, range, sd, se.
Upvotes: 9
Reputation: 6416
dplyr package could be nice alternative to this problem:
library(dplyr)
df %>%
group_by(group) %>%
summarize(mean = mean(dt),
sum = sum(dt))
To get 1st quadrant and 3rd quadrant
df %>%
group_by(group) %>%
summarize(q1 = quantile(dt, 0.25),
q3 = quantile(dt, 0.75))
Upvotes: 70
Reputation: 2780
While some of the other approaches work, this is pretty close to what you were doing and only uses base r. If you know the aggregate command this may be more intuitive.
with( df , aggregate( dt , by=list(group) , FUN=summary) )
Upvotes: 8
Reputation: 1287
Using Hadley Wickham's purrr package this is quite simple. Use split
to split the passed data_frame
into groups, then use map
to apply the summary
function to each group.
library(purrr)
df %>% split(.$group) %>% map(summary)
Upvotes: 45
Reputation: 8413
after 5 long years I'm sure not much attention is going to be received for this answer, But still to make all options complete, here is the one with data.table
library(data.table)
setDT(df)[ , list(mean_gr = mean(dt), sum_gr = sum(dt)) , by = .(group)]
# group mean_gr sum_gr
#1: A 61 244
#2: B 66 396
#3: C 68 408
#4: D 61 488
Upvotes: 12
Reputation: 869
Besides describeBy
, the doBy
package is an another option. It provides much of the functionality of SAS PROC SUMMARY. Details:
http://www.statmethods.net/stats/descriptives.html
Upvotes: 6
Reputation:
There's many different ways to go about this, but I'm partial to describeBy
in the psych
package:
describeBy(df$dt, df$group, mat = TRUE)
Upvotes: 24
Reputation: 19
First, it depends on your version of R. If you've passed 2.11, you can use aggreggate with multiple results functions(summary, by instance, or your own function). If not, you can use the answer made by Justin.
Upvotes: 1
Reputation: 43265
take a look at the plyr
package. Specifically, ddply
ddply(df, .(group), summarise, mean=mean(dt), sum=sum(dt))
Upvotes: 14