Reputation: 6213
I am trying to calculate multiple stats for a dataframe.
I tried dplyr
's summarise_each
. However, the results are returned in a flat, single-row with the function's name added as a suffix.
Is there a direct way - using dplyr
or base r - where I can get the results in a data frame, with the columns as the data frame's columns and the rows as the summary functions?
library(dplyr)
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
df %>% summarise_each(funs(min, max))
# A_min B_min C_min A_max B_max C_max
# 1 13 117 0 98 188 1
# Desired format
summary(df)
# A B C
# Min. :13.00 Min. :117.0 Min. :0.00
# 1st Qu.:34.75 1st Qu.:134.2 1st Qu.:0.00
# Median :45.00 Median :148.0 Median :1.00
# Mean :52.35 Mean :149.9 Mean :0.65
# 3rd Qu.:62.50 3rd Qu.:168.8 3rd Qu.:1.00
# Max. :98.00 Max. :188.0 Max. :1.00
Upvotes: 6
Views: 11680
Reputation: 83215
Why not just simply use sapply
with summary
?
sapply(df, summary)
gives:
A B C Min. 1.00 112.0 0.00 1st Qu. 23.75 134.5 0.00 Median 57.00 148.5 1.00 Mean 50.15 149.9 0.55 3rd Qu. 77.50 167.2 1.00 Max. 94.00 191.0 1.00
To get a dataframe back, just wrap the sapply
call in data.frame()
: data.frame(sapply(df, summary))
. If you want to keep summary statisticnames in a column, you can extract them with rownames(df)
with df$rn <- rownames(df)
or use the keep.rownames
-parameter from data.table
:
library(data.table)
dt <- data.table(sapply(df, summary), keep.rownames = TRUE)
which gives:
> dt rn A B C 1: Min. 11.00 113.0 0.0 2: 1st Qu. 21.50 126.8 0.0 3: Median 55.00 138.0 0.5 4: Mean 53.65 145.2 0.5 5: 3rd Qu. 83.25 160.5 1.0 6: Max. 98.00 193.0 1.0
Upvotes: 10
Reputation: 8377
With the data.frame you proposed, and using the library purrr
library(purrr)
out <- df %>% map(~summary(.)) %>% rbind.data.frame
row.names(out) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
#### A B C
#### Min. 7.00 110.0 0.0
#### 1st Qu. 36.75 132.5 0.0
#### Median 53.50 143.5 0.5
#### Mean 55.45 151.8 0.5
#### 3rd Qu. 82.00 167.0 1.0
#### Max. 99.00 199.0 1.0
There you go. let me just mention that this code works for input data.frame with 100% numeric variables only. It will return an error if there is for instance, a character/factor variable, since the output of summary is quite different.
Upvotes: 4
Reputation: 6213
Thank you all very much for your help! After some cherry-picking, I used the following method.
# Dataframe
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
# Add summary functions to a list
summaryFns = list(
NA.n = function(x) sum(is.na(x)),
NA.percent = function(x) sum(is.na(x))/length(x),
unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))),
min = function(x) min(x, na.rm=TRUE),
max = function(x) max(x, na.rm=TRUE))
# Summarise data frame with each function
# Using dplyr:
library(dplyr)
sapply(summaryFns, function(fn){df %>% summarise_all(fn)})
# NA.n NA.percent unique.n min max
# A 0 0 20 1 98
# B 0 0 20 114 200
# C 0 0 2 0 1
# Using base-r:
sapply(summaryFns, function(fn){sapply(df, fn)})
# NA.n NA.percent unique.n min max
# A 0 0 20 1 98
# B 0 0 20 114 200
# C 0 0 2 0 1
I think that is the most straight forward and flexible method.
Further comments, modifications and suggestions are appreciated.
Upvotes: 0
Reputation: 2621
An approach without using tidyr
nor dplyr
:
df <- data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
df %>%
lapply(summary) %>%
do.call("rbind", .)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
A 9 32.5 50.5 49.65 70.25 84
B 116 137.2 162.5 157.70 178.20 196
C 0 0.0 0.0 0.45 1.00 1
If you want to do it with dplyr
, try:
df %>%
gather(attribute, value) %>%
group_by(attribute) %>%
do(as.data.frame(as.list(summary(.$value))))
Output:
Source: local data frame [3 x 7]
Groups: attribute [3]
attribute Min. X1st.Qu. Median Mean X3rd.Qu. Max.
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 9 32.5 50.5 49.65 70.25 84
2 B 116 137.2 162.5 157.70 178.20 196
3 C 0 0.0 0.0 0.45 1.00 1
Upvotes: 1
Reputation: 35187
How about:
library(tidyr)
gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))
# A tibble: 3 × 3 key min max <chr> <dbl> <dbl> 1 A 2 92 2 B 111 194 3 C 0 1
Upvotes: 10
Reputation: 6542
It is not the only way to go but you could reshape your data.frame as you want using dplyr
and tidyr
. (and stringr
or other to trim character.)
library(dplyr)
df = data.frame(A = sample(1:100, 20),
B = sample(110:200, 20),
C = sample(c(0,1), 20, replace = T))
as_data_frame(summary(df)) %>%
# some blank character could be trim
mutate(Var2 = stringr::str_trim(Var2)) %>%
# you don't need Var1
select(-Var1) %>%
# Get the type of summary and the value
tidyr::separate(n, c("Type", "value"), sep = ":") %>%
# Convert value to numeric
mutate(value = as.numeric(value)) %>%
# reshape as you wish
tidyr::spread(Var2, value, drop = T)
#> # A tibble: 6 x 4
#> Type A B C
#> * <chr> <dbl> <dbl> <dbl>
#> 1 1st Qu. 36.25 122.2 1.00
#> 2 3rd Qu. 77.25 164.5 1.00
#> 3 Max. 95.00 193.0 1.00
#> 4 Mean 57.30 144.6 0.85
#> 5 Median 63.00 143.5 1.00
#> 6 Min. 8.00 111.0 0.00
Upvotes: 2