Reputation: 1109
I'm currently trying to get my head around the differences between stat_*
and geom_*
in ggplot2
. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).
My current understanding is that is that the stat_*
functions apply a transformation to your data and that the result is then passed onto the geom_*
to be displayed.
Most simple example being the identity transformation which simply passes your data untransformed onto the geom.
ggplot(data = iris) +
stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")
More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")
So how / when are these transformations applied to the dataset and how does data pass through them exactly?
As an example, say I wanted to take the stat_boxplot
transformation and plot the point of the 3rd quartile how would I do this ?
My intuition would be something like :
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")
or
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")
however both error with
Error: geom_point requires the following missing aesthetics: y
My guess is as part of the stat_boxplot
transformation it consumes the y aesthetic and produces a dataset not containing any y
variable however this leads onto ....
Where can I find out which variables are consumed as part of the stat_*
transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...
Upvotes: 3
Views: 474
Reputation: 1254
Interesting questions...
As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.
The main steps for building a graph with one layer and no facet are:
..<name>..
or stat(name)
As you guessed, the behaviour at step 3 is similar to dplyr::transmute()
: it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y
column isn't passed to the geom.
To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:
# This does not work:
ggplot(data = iris) +
geom_point(
aes(x=Species, y=stat(upper)),
stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )
... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).
NB: stat(upper)
is an equivalent, more recent, notation to your ..upper..
I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():
library(tidyverse)
iris %>%
group_by(Species) %>%
select(y=Sepal.Length) %>%
do(StatBoxplot$compute_group(.)) %>%
ggplot(aes(Species, upper)) + geom_point()
A bit less elegant, I admit...
For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot
Upvotes: 3