gowerc
gowerc

Reputation: 1109

How do ggplot stat_* functions work conceptually?

I'm currently trying to get my head around the differences between stat_* and geom_* in ggplot2. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).

Introduction

My current understanding is that is that the stat_* functions apply a transformation to your data and that the result is then passed onto the geom_* to be displayed.

Most simple example being the identity transformation which simply passes your data untransformed onto the geom.

ggplot(data = iris) + 
    stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")

More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:

ggplot(data = iris) + 
    stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")

Question 1

So how / when are these transformations applied to the dataset and how does data pass through them exactly?

As an example, say I wanted to take the stat_boxplot transformation and plot the point of the 3rd quartile how would I do this ?

My intuition would be something like :

ggplot(data = iris) + 
    stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")

or

ggplot(data = iris) + 
    stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")

however both error with

Error: geom_point requires the following missing aesthetics: y

My guess is as part of the stat_boxplot transformation it consumes the y aesthetic and produces a dataset not containing any y variable however this leads onto ....

Question 2

Where can I find out which variables are consumed as part of the stat_* transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...

Upvotes: 3

Views: 474

Answers (1)

Pierre Gramme
Pierre Gramme

Reputation: 1254

Interesting questions...

As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.

The main steps for building a graph with one layer and no facet are:

  1. Apply aesthetics mapping on input data (in simple cases, this is a selection and renaming on columns)
  2. Apply scale transformation (if any) on each data column
  3. Compute stat on each data group (i.e. per Species in this case)
  4. Apply aesthetics mapping on stat data, detected with ..<name>.. or stat(name)
  5. Apply position adjustment
  6. Build graphical objects
  7. Apply coordinate transformations

As you guessed, the behaviour at step 3 is similar to dplyr::transmute(): it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y column isn't passed to the geom.

To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:

# This does not work:
ggplot(data = iris) + 
  geom_point(
    aes(x=Species, y=stat(upper)), 
    stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )

... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).

NB: stat(upper) is an equivalent, more recent, notation to your ..upper..

I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():

library(tidyverse)
iris %>%
  group_by(Species) %>%
  select(y=Sepal.Length) %>% 
  do(StatBoxplot$compute_group(.)) %>%
  ggplot(aes(Species, upper)) + geom_point()

A bit less elegant, I admit...

For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot

Upvotes: 3

Related Questions