Reputation: 174778
I am trying to draw an ECDF of some data with a "confidence interval" represented via a shaded region using ggplot2. I am having trouble combining geom_ribbon()
with stat_ecdf()
to achieve the effect I am after.
Consider the following example data:
set.seed(1)
dat <- data.frame(variable = rlnorm(100) + 2)
dat <- transform(dat, lower = variable - 2, upper = variable + 2)
> head(dat)
variable lower upper
1 2.534484 0.5344838 4.534484
2 3.201587 1.2015872 5.201587
3 2.433602 0.4336018 4.433602
4 6.929713 4.9297132 8.929713
5 3.390284 1.3902836 5.390284
6 2.440225 0.4402254 4.440225
I am able to produce an ECDF of variable
using
library("ggplot2")
ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf")
However I am unable to use lower
and upper
as the ymin
and ymax
aesthetics of geom_ribbon()
to superimpose the confidence interval on the plot as another layer. I have tried:
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(ymin = lower, ymax = upper), stat = "ecdf") +
geom_step(stat = "ecdf")
but this raises the following error
Error: geom_ribbon requires the following missing aesthetics: ymin, ymax
Is there a way to coax geom_ribbon()
into working with stat_ecdf()
to produce a shaded confidence interval? Or, can anyone suggest an alternative means of adding a shaded polygon defined by lower
and upper
as a layer to the ECDF plot?
Upvotes: 7
Views: 2395
Reputation: 13
I was looking for a solution to a similar problem and came across this answer. I slightly modified the procedure to make it more universal - it also worked on charts with different aesthetics and facets.
First, I load the necessary packages and generate the dummy data.
library(reprex)
library(tidyverse)
set.seed(118)
data1 <- tibble(a = rnorm(100, mean = 20, sd = 3),
ID = "a",
gr = "A") %>%
bind_rows(tibble(a = rnorm(100, mean = 30, sd = 3),
ID = "z",
gr = "A")) %>%
bind_rows(tibble(a = rnorm(100, mean = 25, sd = 3),
ID = "a",
gr = "B")) %>%
bind_rows(tibble(a = rnorm(100, mean = 40, sd = 3),
ID = "z",
gr = "B")) %>%
mutate(CI_low = a - 3,
CI_high = a + 3)
Then I generate a chart:
data1 %>%
ggplot(aes(x = a, color = gr))+
geom_step(stat = "ecdf")+
facet_wrap(~ ID)
After generating the chart, you can preview its data, along with the calculated ecdf:
ggplot_build(last_plot())$data[[1]] %>%
as_tibble()
#> # A tibble: 408 × 10
#> colour y x ecdf flipped_aes PANEL group linewidth linetype alpha
#> <chr> <dbl> <dbl> <dbl> <lgl> <fct> <int> <dbl> <dbl> <lgl>
#> 1 #F8766D 0 -Inf 0 FALSE 1 1 0.5 1 NA
#> 2 #F8766D 0.09 15.0 0.09 FALSE 1 1 0.5 1 NA
#> 3 #F8766D 0.55 20.5 0.55 FALSE 1 1 0.5 1 NA
#> 4 #F8766D 0.51 20.0 0.51 FALSE 1 1 0.5 1 NA
#> 5 #F8766D 0.6 20.9 0.6 FALSE 1 1 0.5 1 NA
#> 6 #F8766D 0.08 14.9 0.08 FALSE 1 1 0.5 1 NA
#> 7 #F8766D 0.19 16.9 0.19 FALSE 1 1 0.5 1 NA
#> 8 #F8766D 0.47 19.7 0.47 FALSE 1 1 0.5 1 NA
#> 9 #F8766D 0.73 22.0 0.73 FALSE 1 1 0.5 1 NA
#> 10 #F8766D 0.34 18.5 0.34 FALSE 1 1 0.5 1 NA
#> # ℹ 398 more rows
The next step is to assign the values of the grouping variables to individual aesthetics (in my case, color) and groups of variables (facets' names). This must be done manually, paying special attention to ensure that the assignment is correct.
graph_data <- ggplot_build(last_plot())$data[[1]] %>%
as_tibble() %>%
mutate(gr = case_match(group,
1 ~ "A",
2 ~ "B")) %>%
mutate(ID = case_match(PANEL,
"1" ~ "a",
"2" ~ "z")) %>%
select(a = x, ecdf, gr, ID)
In the next step, I combine the data with the original data and generate a new chart:
data1 %>%
left_join(graph_data) %>%
ggplot(aes(x = a, color = gr))+
geom_ribbon(aes(xmin = CI_low, xmax = CI_high, y = ecdf, fill = gr, color = NULL), alpha = 0.1)+
geom_step(stat = "ecdf")+
facet_wrap(~ ID)
#> Joining with `by = join_by(a, ID, gr)`
Created on 2024-07-25 with reprex v2.1.1
Upvotes: 0
Reputation: 8691
Not sure exactly how you want to reflect the CI, but ggplot_build()
lets you get the generated data back from the plot, you can then overplot what you like.
This chart shows:
g<-ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.5, fill="red")
inside<-ggplot_build(g)
matched<-merge(inside$data[[1]],data.frame(x=dat$variable,dat$lower,dat$upper),by=("x"))
g +
geom_ribbon(data=matched, aes(x = x,
ymin = y + dat.upper-x,
ymax = y - x + dat.lower),
alpha=0.5, fill="blue") +
geom_ribbon(data=matched, aes(x = x,
ymin = ecdf(dat.lower)(x),
ymax = ecdf(dat.upper)(x)),
alpha=0.5, fill="green")
Upvotes: 2
Reputation: 173517
Try this (a bit of shot in the dark):
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(x = variable,ymin = ..y..-2,ymax = ..y..+2), stat = "ecdf",alpha=0.2) +
geom_step(stat = "ecdf")
Ok, so that's not the same thing as what you trying to do, but it should explain what's going on. The stat
is returning a data frame with just the original x and the computed y, so I think that's all you have to work with. i.e. stat_ecdf
only computes the cumulative distribution function for a single x at a time.
The only other thing I can think of is the obvious, calculating the lower and upper separately, something like this:
l <- ecdf(dat$lower)
u <- ecdf(dat$upper)
v <- ecdf(dat$variable)
dat$lower1 <- l(dat$variable)
dat$upper1 <- u(dat$variable)
dat$variable1 <- v(dat$variable)
ggplot(dat,aes(x = variable)) +
geom_step(aes(y = variable1)) +
geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)
Upvotes: 3