A histogram with individual proportions on one Y-axis and cumulative proportion on another

Question

I am looking for a solution in [R] to make a chart like I show here (it is made in Excel):

I can make a histogram using below code:

ggplot(data=TestData, aes(x=QP1)) + geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 0.1, fill = "lightblue", color="black")+ scale_x_continuous(breaks = seq(8,10,0.1)) + scale_y_continuous(labels = percent_format(), breaks = seq(0,1,0.05)) + xlab("QP1")

but I could not make a secondary axis and a line plot overlaid on the histogram. I found several example on this site asking similar question, but still had difficulty in truly understanding those solutions.

I need help for:

Recommended code to achieve cumulative line plot, but appreciate if some guidance is added to explain the mechanism behind working.
I want to control bin width nicely and add some statistic like mean/stdev etc on the chart, am not lazy if have to use dplyr or do some additional working.

Thanks.

Edit: I could achieve what I wanted initially. During further enhancement, such as adding adding data labels is a challenge in this matter:

    ggplot(data=TestData, aes(x=QP1, y=after_stat(count / sum(count)))) +
         
  geom_histogram(fill = "darkorange", color="black", binwidth = 0.1) + 
  
  
  stat_bin(aes(y = after_stat(cumsum(count / sum(count)) * 0.5)),
           geom = "line", colour = "dodgerblue",binwidth = 0.1) + 
  
  stat_bin(aes(label =  after_stat(scales::percent(count / sum(count)))),
           geom = "text",colour="blue", binwidth = 0.1,vjust=1)  +
  
  stat_bin(aes(label =  after_stat(scales::percent(cumsum((count / sum(count)))))),
           geom = "text",colour="blue", binwidth = 0.1, vjust=-4)  +
  
  
  
  scale_y_continuous(
    labels = scales::percent, breaks = seq(0,5,.1),
    name = "Proportion",
    sec.axis = sec_axis(~ .x * 2, 
                        name = "Cumulative Proportion",
                        labels = scales::percent, breaks = seq(0,1,.2)))

Data labels are added well and show correct numbers, cumulative labels need to be positioned as per sec.axis, how to do that? if we transform by add/div, label value changed not the position. Please suggest.

teunbrand · Accepted Answer

So a couple of things about secondary axes:

You must transform input data yourself to fit it on the primary axis.
You must give the inverse transform to the trans argument of the secondary axis.

In the code below we achieve (1) by doing y = after_stat(cumsum(count / sum(count)) * 0.1. The after_stat() part replaces the older syntax of ..variable... The cumsum() calculates the cumulative sum of the proportions, giving the cumulative proportions. The * 0.1 is dividing the cumulative data by 10 to achieve (1). Then, to achieve (2) you should give the secondary axis ~ .x * 10 to scale up the number on the axis itself. You can change these scaling factors depending on the plot, but be sure to change them at both places.

library(ggplot2)

df <- data.frame(
  x = rnorm(100)
)

ggplot(df, aes(x)) +
  geom_histogram(aes(y = after_stat(count / sum(count))),
                 fill = "darkorange")  +
  stat_bin(aes(y = after_stat(cumsum(count / sum(count)) * 0.1)),
           geom = "line", colour = "dodgerblue") +
  # Set secondary axis in y scale
  scale_y_continuous(
    labels = scales::percent,
    name = "Proportion",
    sec.axis = sec_axis(~ .x * 10, 
                        name = "Cumulative Proportion",
                        labels = scales::percent)
  ) +
  # For pretty colours
  theme(
    axis.line.y.left = element_line(colour = "darkorange"),
    axis.text.y.left = element_text(colour = "darkorange"),
    axis.ticks.y.left = element_line(colour = "darkorange"),
    axis.title.y.left = element_text(colour = "darkorange"),
    axis.line.y.right = element_line(colour = "dodgerblue"),
    axis.text.y.right = element_text(colour = "dodgerblue"),
    axis.ticks.y.right = element_line(colour = "dodgerblue"),
    axis.title.y.right = element_text(colour = "dodgerblue")
  )
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2021-01-21 by the reprex package (v0.3.0)}

EDIT:

With regards to sec_axis(~ .x * 10, ...), this is called 'lambda syntax' where you create a one-sided formula (only right hand side is defined), that will be converted to a function by rlang::as_function(). The .x is a placeholder for the input data, so the ~ .x * 10 can be read as function(x) {x * 10}. This does not work in general, but many tidyverse packages accept this notation at various points.

The after_stat() function is the newer notations of ..variable.., such that after_stat(count/sum(count)) is the same as (..count..) / sum(..count..) you use in your example. The difference is that you don't need to wrap every variable in ..'s and it is generally more flexible. The after_stat() function causes whatever is inside that function to be evaluated after the stat layer has computed the stats. The count variable is not an aesthetic you define, it is a computed variable that the stat layer produces, so we need after_stat() to do something with that variable.

A histogram with individual proportions on one Y-axis and cumulative proportion on another

Answers (1)

Related Questions