Reputation: 458
I have a list of data frames, organized by year. I am using lapply
to get the summary for a single variable in each data frame. The output follows the list and gives a summary for each year, one by one. However, I want the output in the form of a single table with years for rows. How do I do this? An example using the iris dataset shows my problem:
x <- split(iris$Sepal.Length, iris$Species)
lapply(x, summary)
And the output is:
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 4.800 5.000 5.006 5.200 5.800
Similarly for the other two.
I want the output organized as a single table like with:
> sapply(x, summary)
setosa versicolor virginica
Min. 4.300 4.900 4.900
1st Qu. 4.800 5.600 6.225
Median 5.000 5.900 6.500
Mean 5.006 5.936 6.588
3rd Qu. 5.200 6.300 6.900
Max. 5.800 7.000 7.900
But with setosa, versicolor, virginica (or years in my case) on the left and Min... Max up top. I can flip the axes around in ggplot, but reading the table as-is is more intuitive with the years on the left. I came across a number of discussions about converting lapply
output but the ones I came across were all measuring a single stat like mean or median. Thanks.
Upvotes: 0
Views: 1656
Reputation: 42564
If you have a large data.frame, I recommend not to split it into pieces but to use data.table
for grouping by year. With the iris
data set this could be done along
library(data.table)
setDT(copy(iris))[, as.list(summary(Sepal.Length)), by = Species]
# Species Min. 1st Qu. Median Mean 3rd Qu. Max.
#1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
as.list()
ensures the output of summary()
appears column-wise as requested.
The result is a data.table
(not a matrix
) which can be used directly in a subsequent ggplot()
call.
Note that copy(iris)
is only required here because the iris
data set is locked to prevent modifying its variable bindings. With your own data.frame df
you would simply use setDT(df)
to coerce to data.table without copying.
The OP mentioned that he uses the result for plotting with ggplot2
. Now, ggplot2
works best when data are provided in long format. Reshaping a data.table from wide to long format can be conveniently done with melt()
wideDT <- setDT(copy(iris))[, summary(Sepal.Length), by = Species]
longDT <- melt(wideDT, id.vars = "Species")
longDT
# Species variable value
# 1: setosa Min. 4.300
# 2: versicolor Min. 4.900
# 3: virginica Min. 4.900
# 4: setosa 1st Qu. 4.800
# 5: versicolor 1st Qu. 5.600
# 6: virginica 1st Qu. 6.225
# 7: setosa Median 5.000
# 8: versicolor Median 5.900
# 9: virginica Median 6.500
#10: setosa Mean 5.006
#11: versicolor Mean 5.936
#12: virginica Mean 6.588
#13: setosa 3rd Qu. 5.200
#14: versicolor 3rd Qu. 6.300
#15: virginica 3rd Qu. 6.900
#16: setosa Max. 5.800
#17: versicolor Max. 7.000
#18: virginica Max. 7.900
Upvotes: 1
Reputation: 99361
This seems like a good time to use by()
. It eliminates the need for the call to split()
, is all done in one line, and returns a matrix.
with(iris, do.call(rbind, by(Sepal.Length, Species, summary)))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# setosa 4.3 4.800 5.0 5.006 5.2 5.8
# versicolor 4.9 5.600 5.9 5.936 6.3 7.0
# virginica 4.9 6.225 6.5 6.588 6.9 7.9
If you still wish to use manual split-apply-combine method, then it would be
do.call(rbind, lapply(x, summary))
Upvotes: 1