Reputation: 7251
How do I group by columns, then compute the mean and standard deviation of every other column in R?
As an example, consider the famous Iris data set. I want to do something similar to grouping by species, then compute the mean and sd of the petal/sepal length/width measurements. I know that this has something to do with split-apply-combine, but I am not sure how to proceed from there.
What I can come up with:
require(plyr)
x <- ddply(iris, .(Species), summarise,
Sepal.Length.Mean = mean(Sepal.Length),
Sepal.Length.Sd = sd(Sepal.Length),
Sepal.Width.Mean = mean(Sepal.Width),
Sepal.Width.Sd = sd(Sepal.Width),
Petal.Length.Mean = mean(Petal.Length),
Petal.Length.Sd = sd(Petal.Length),
Petal.Width.Mean = mean(Petal.Width),
Petal.Width.Sd = sd(Petal.Width))
Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1 setosa 5.006 0.3524897 3.428 0.3790644
2 versicolor 5.936 0.5161711 2.770 0.3137983
3 virginica 6.588 0.6358796 2.974 0.3224966
Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1 1.462 0.1736640 0.246 0.1053856
2 4.260 0.4699110 1.326 0.1977527
3 5.552 0.5518947 2.026 0.2746501
Desired output:
z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644,
1.462, 0.1736640, 0.246, 0.1053856),
versicolor = c(5.936, 0.5161711, 2.770, 0.3137983,
4.260, 0.4699110, 1.326, 0.1977527),
virginica = c(6.588, 0.6358796, 2.974, 0.3225966,
5.552, 0.5518947, 2.026, 0.2746501))
rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd',
'Sepal.Width.Mean', 'Sepal.Width.Sd',
'Petal.Length.Mean', 'Petal.Length.Sd',
'Petal.Width.Mean', 'Petal.Width.Sd')
setosa versicolor virginica
Sepal.Length.Mean 5.0060000 5.9360000 6.5880000
Sepal.Length.Sd 0.3524897 0.5161711 0.6358796
Sepal.Width.Mean 3.4280000 2.7700000 2.9740000
Sepal.Width.Sd 0.3790644 0.3137983 0.3225966
Petal.Length.Mean 1.4620000 4.2600000 5.5520000
Petal.Length.Sd 0.1736640 0.4699110 0.5518947
Petal.Width.Mean 0.2460000 1.3260000 2.0260000
Petal.Width.Sd 0.1053856 0.1977527 0.2746501
Upvotes: 3
Views: 1578
Reputation: 7251
Inspired by the answers, I figured out a solution that also works,
using only dplyr
and tidyr
functions.
require(tidyr)
require(dplyr)
x <- iris %>%
gather(var, value, -Species)
print(tbl_df(x))
# Compute the mean and sd for each dimension
x <- x %>%
group_by(Species, var) %>%
summarise(mean = mean(value), sd = sd(value)) %>%
ungroup
print(tbl_df(x))
# Convert the data frame from wide form to long form
x <- x %>%
gather(stat, value, mean:sd)
print(tbl_df(x))
# Combine the variables "var" and "stat" into a single variable
x <- x %>%
unite(var, var, stat, sep = '.')
print(tbl_df(x))
# Convert the data frame from long form to wide form
x <- x %>%
spread(Species, value)
print(tbl_df(x))
Upvotes: 1
Reputation: 8760
If you want to use data.table
for performance reasons you could try this
(don't be afraid - more comments than code ;-) I have tried to optimize all performance critical spots.
library(data.table)
dt <- as.data.table(iris)
# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
{
result <- lapply(data, func) # apply the function to each column of the data table
setDT(result) # convert the result list into a data table efficiently ("by ref")
setnames(result, names(result), paste0(names(result), suffix)) # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
}
wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)
# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")
# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)
wide.result
is:
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1: setosa 5.006 3.428 1.462 0.246 0.3524897 0.3790644 0.1736640 0.1053856
2: versicolor 5.936 2.770 4.260 1.326 0.5161711 0.3137983 0.4699110 0.1977527
3: virginica 6.588 2.974 5.552 2.026 0.6358796 0.3224966 0.5518947 0.2746501
long.result
is:
Species variable value
1: setosa Sepal.Length.mean 5.0060000
2: versicolor Sepal.Length.mean 5.9360000
3: virginica Sepal.Length.mean 6.5880000
4: setosa Sepal.Width.mean 3.4280000
5: versicolor Sepal.Width.mean 2.7700000
6: virginica Sepal.Width.mean 2.9740000
7: setosa Petal.Length.mean 1.4620000
8: versicolor Petal.Length.mean 4.2600000
9: virginica Petal.Length.mean 5.5520000
10: setosa Petal.Width.mean 0.2460000
11: versicolor Petal.Width.mean 1.3260000
12: virginica Petal.Width.mean 2.0260000
13: setosa Sepal.Length.sd 0.3524897
14: versicolor Sepal.Length.sd 0.5161711
15: virginica Sepal.Length.sd 0.6358796
16: setosa Sepal.Width.sd 0.3790644
17: versicolor Sepal.Width.sd 0.3137983
18: virginica Sepal.Width.sd 0.3224966
19: setosa Petal.Length.sd 0.1736640
20: versicolor Petal.Length.sd 0.4699110
21: virginica Petal.Length.sd 0.5518947
22: setosa Petal.Width.sd 0.1053856
23: versicolor Petal.Width.sd 0.1977527
24: virginica Petal.Width.sd 0.2746501
final.result
is:
variable setosa versicolor virginica
1: Sepal.Length.mean 5.0060000 5.9360000 6.5880000
2: Sepal.Width.mean 3.4280000 2.7700000 2.9740000
3: Petal.Length.mean 1.4620000 4.2600000 5.5520000
4: Petal.Width.mean 0.2460000 1.3260000 2.0260000
5: Sepal.Length.sd 0.3524897 0.5161711 0.6358796
6: Sepal.Width.sd 0.3790644 0.3137983 0.3224966
7: Petal.Length.sd 0.1736640 0.4699110 0.5518947
8: Petal.Width.sd 0.1053856 0.1977527 0.2746501
The only difference to your desired output is the final
results contains the value names in the first column named variable
instead of storing this in the row names. This could be done by setting the row names to the first column and removing the first column...
Upvotes: 1
Reputation: 121127
Here is the traditional plyr
approach. It uses colwise
to compute summary statistics on all columns.
means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))
Upvotes: 3
Reputation: 887511
We can try with dplyr
library(dplyr)
res <- iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
# setosa versicolor virginica
#Sepal.Length_mean 5.0060000 5.9360000 6.5880000
#Sepal.Width_mean 3.4280000 2.7700000 2.9740000
#Petal.Length_mean 1.4620000 4.2600000 5.5520000
#Petal.Width_mean 0.2460000 1.3260000 2.0260000
#Sepal.Length_sd 0.3524897 0.5161711 0.6358796
#Sepal.Width_sd 0.3790644 0.3137983 0.3224966
#Petal.Length_sd 0.1736640 0.4699110 0.5518947
#Petal.Width_sd 0.1053856 0.1977527 0.2746501
Or as @Steven Beaupre mentioned in the comments, the output can be obtained by reshaping with spread
library(tidyr)
iris %>%
group_by(Species) %>%
summarise_each(funs(mean, sd)) %>%
gather(key, value, -Species) %>%
spread(Species, value)
Upvotes: 9