Gavin Simpson
Gavin Simpson

Reputation: 174778

How to combine stat_ecdf with geom_ribbon?

I am trying to draw an ECDF of some data with a "confidence interval" represented via a shaded region using ggplot2. I am having trouble combining geom_ribbon() with stat_ecdf() to achieve the effect I am after.

Consider the following example data:

set.seed(1)
dat <- data.frame(variable = rlnorm(100) + 2)
dat <- transform(dat, lower = variable - 2, upper = variable + 2)

> head(dat)
  variable     lower    upper
1 2.534484 0.5344838 4.534484
2 3.201587 1.2015872 5.201587
3 2.433602 0.4336018 4.433602
4 6.929713 4.9297132 8.929713
5 3.390284 1.3902836 5.390284
6 2.440225 0.4402254 4.440225

I am able to produce an ECDF of variable using

library("ggplot2")
ggplot(dat, aes(x = variable)) +
    geom_step(stat = "ecdf")

However I am unable to use lower and upper as the ymin and ymax aesthetics of geom_ribbon() to superimpose the confidence interval on the plot as another layer. I have tried:

ggplot(dat, aes(x = variable)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), stat = "ecdf") +
    geom_step(stat = "ecdf")

but this raises the following error

Error: geom_ribbon requires the following missing aesthetics: ymin, ymax

Is there a way to coax geom_ribbon() into working with stat_ecdf() to produce a shaded confidence interval? Or, can anyone suggest an alternative means of adding a shaded polygon defined by lower and upper as a layer to the ECDF plot?

Upvotes: 7

Views: 2395

Answers (3)

wacekk
wacekk

Reputation: 13

I was looking for a solution to a similar problem and came across this answer. I slightly modified the procedure to make it more universal - it also worked on charts with different aesthetics and facets.

First, I load the necessary packages and generate the dummy data.

library(reprex)
library(tidyverse)

set.seed(118)
data1 <- tibble(a = rnorm(100, mean = 20, sd = 3),
                ID = "a",
                gr = "A") %>% 
  bind_rows(tibble(a = rnorm(100, mean = 30, sd = 3),
                   ID = "z",
                   gr = "A")) %>% 
  bind_rows(tibble(a = rnorm(100, mean = 25, sd = 3),
                   ID = "a",
                   gr = "B")) %>% 
  bind_rows(tibble(a = rnorm(100, mean = 40, sd = 3),
                   ID = "z",
                   gr = "B")) %>% 
  mutate(CI_low = a - 3,
         CI_high = a + 3)

Then I generate a chart:

data1 %>% 
  ggplot(aes(x = a, color = gr))+
  geom_step(stat = "ecdf")+
  facet_wrap(~ ID)

After generating the chart, you can preview its data, along with the calculated ecdf:


ggplot_build(last_plot())$data[[1]] %>% 
  as_tibble()
#> # A tibble: 408 × 10
#>    colour      y      x  ecdf flipped_aes PANEL group linewidth linetype alpha
#>    <chr>   <dbl>  <dbl> <dbl> <lgl>       <fct> <int>     <dbl>    <dbl> <lgl>
#>  1 #F8766D  0    -Inf    0    FALSE       1         1       0.5        1 NA   
#>  2 #F8766D  0.09   15.0  0.09 FALSE       1         1       0.5        1 NA   
#>  3 #F8766D  0.55   20.5  0.55 FALSE       1         1       0.5        1 NA   
#>  4 #F8766D  0.51   20.0  0.51 FALSE       1         1       0.5        1 NA   
#>  5 #F8766D  0.6    20.9  0.6  FALSE       1         1       0.5        1 NA   
#>  6 #F8766D  0.08   14.9  0.08 FALSE       1         1       0.5        1 NA   
#>  7 #F8766D  0.19   16.9  0.19 FALSE       1         1       0.5        1 NA   
#>  8 #F8766D  0.47   19.7  0.47 FALSE       1         1       0.5        1 NA   
#>  9 #F8766D  0.73   22.0  0.73 FALSE       1         1       0.5        1 NA   
#> 10 #F8766D  0.34   18.5  0.34 FALSE       1         1       0.5        1 NA   
#> # ℹ 398 more rows

The next step is to assign the values ​​of the grouping variables to individual aesthetics (in my case, color) and groups of variables (facets' names). This must be done manually, paying special attention to ensure that the assignment is correct.

graph_data <- ggplot_build(last_plot())$data[[1]] %>% 
  as_tibble() %>% 
  mutate(gr = case_match(group,
                         1 ~ "A",
                         2 ~ "B")) %>% 
  mutate(ID = case_match(PANEL,
                         "1" ~ "a",
                         "2" ~ "z")) %>% 
  select(a = x, ecdf, gr, ID)

In the next step, I combine the data with the original data and generate a new chart:

data1 %>% 
  left_join(graph_data) %>% 
  ggplot(aes(x = a, color = gr))+
  geom_ribbon(aes(xmin = CI_low, xmax = CI_high, y = ecdf, fill = gr, color = NULL), alpha = 0.1)+
  geom_step(stat = "ecdf")+
  facet_wrap(~ ID)
#> Joining with `by = join_by(a, ID, gr)`

Created on 2024-07-25 with reprex v2.1.1

Upvotes: 0

Troy
Troy

Reputation: 8691

Not sure exactly how you want to reflect the CI, but ggplot_build() lets you get the generated data back from the plot, you can then overplot what you like.

This chart shows:

  • red = original ribbon
  • blue = takes the original CI vectors and applies to the ecdf curve
  • green = calculates the ecdf of upper and lower series and plots

enter image description here

    g<-ggplot(dat, aes(x = variable)) +
      geom_step(stat = "ecdf") +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.5, fill="red") 

    inside<-ggplot_build(g)
    matched<-merge(inside$data[[1]],data.frame(x=dat$variable,dat$lower,dat$upper),by=("x"))

    g + 
      geom_ribbon(data=matched, aes(x = x, 
                                      ymin = y + dat.upper-x,
                                      ymax = y - x + dat.lower), 
                    alpha=0.5, fill="blue") +
      geom_ribbon(data=matched, aes(x = x, 
                                      ymin = ecdf(dat.lower)(x),
                                      ymax = ecdf(dat.upper)(x)), 
                    alpha=0.5, fill="green")

Upvotes: 2

joran
joran

Reputation: 173517

Try this (a bit of shot in the dark):

ggplot(dat, aes(x = variable)) +
  geom_ribbon(aes(x = variable,ymin = ..y..-2,ymax = ..y..+2), stat = "ecdf",alpha=0.2) +
  geom_step(stat = "ecdf")

Ok, so that's not the same thing as what you trying to do, but it should explain what's going on. The stat is returning a data frame with just the original x and the computed y, so I think that's all you have to work with. i.e. stat_ecdf only computes the cumulative distribution function for a single x at a time.

The only other thing I can think of is the obvious, calculating the lower and upper separately, something like this:

l <- ecdf(dat$lower)
u <- ecdf(dat$upper)
v <- ecdf(dat$variable)
dat$lower1 <- l(dat$variable)
dat$upper1 <- u(dat$variable)
dat$variable1 <- v(dat$variable)

ggplot(dat,aes(x = variable)) + 
  geom_step(aes(y = variable1)) + 
  geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)

Upvotes: 3

Related Questions