Tomas R
Tomas R

Reputation: 556

Whats the right way to add text to geom_histogram in ggplot?

I've plotted a histograph with wage on the x-axis and a y-axis that shows the percentage of individuals in the data set that has this particular wage. Now I want the individual bars to display how many observarions there is in every bar. e.g in the sample_data I've provided, how many wages is in the 10% bars and how many in the 20% bars?

Here's a small sample of my data:


sample_data<- structure(list(wage = c(81L, 77L, 63L, 84L, 110L, 151L, 59L, 
                                109L, 159L, 71L), school = c(15L, 12L, 10L, 15L, 16L, 18L, 11L, 
                                                             12L, 10L, 11L), expr = c(17L, 10L, 18L, 16L, 13L, 15L, 19L, 20L, 
                                                                                      21L, 20L), public = c(0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), 
                       female = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), industry = c(63L, 
                                                                                        93L, 71L, 34L, 83L, 38L, 82L, 50L, 71L, 37L)), row.names = c("1", 
                                                                                                                                                     "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")

Here's my R script

library(ggplot2)
library(dplyr)

ggplot(data = sample_data) +
  geom_histogram(aes(x = wage, y = stat(count) / sum(count)), binwidth = 4, color = "black") + 
  scale_x_continuous(breaks = seq(0, 300, by = 20)) +
  scale_y_continuous(labels = scales::percent_format()) 

enter image description here

I'm happy with this basically, but whatever I try -- I can't get text on top of my columns. Here is one example of many using stat_count that doesn't work:

ggplot(data = sample_data) +
  geom_histogram(aes(x = wage, y = stat(count) / sum(count)), binwidth = 4, color = "black") + 
  scale_x_continuous(breaks = seq(0, 300, by = 20)) +
  scale_y_continuous(labels = scales::percent_format()) + 
  stat_count(aes(y = ..count.., label =..count..), geom = "text")

Iv'e also tried using geom_text to no avail.

EDIT: ANSWER!

Many thanks too those who replied. I ended up using teunbrand's solution with a small modification where I changed after_stat(density) to after_stat(count) / sum(count).

Here's the 'final' code:

ggplot(sample_data) +
  geom_histogram(
    aes(x = wage,
        y = after_stat(count) / sum(count)),
    binwidth = 4, colour = "black"
  ) +
  stat_bin(
    aes(x = wage,
        y = after_stat(count) / sum(count),
        label = after_stat(ifelse(count == 0, "", count))),
    binwidth = 4, geom = "text", vjust = -1) + 
  scale_x_continuous(breaks = seq(0, 300, by = 20)) +
  scale_y_continuous(labels = scales::percent_format()) 

Upvotes: 2

Views: 2792

Answers (2)

teunbrand
teunbrand

Reputation: 38053

Different layers typically don't share stateful information, so you could use the same stat as the histogram (stat_bin()) to display the labels. Then, you can use after_stat() to use the computed variables of the stat part of the layer to make labels.

library(ggplot2)

sample_data<- structure(list(
  wage = c(81L, 77L, 63L, 84L, 110L, 151L, 59L, 109L, 159L, 71L), 
  school = c(15L, 12L, 10L, 15L, 16L, 18L, 11L, 12L, 10L, 11L), 
  expr = c(17L, 10L, 18L, 16L, 13L, 15L, 19L, 20L, 21L, 20L), 
  public = c(0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L),
  female = c(1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), 
  industry = c(63L, 93L, 71L, 34L, 83L, 38L, 82L, 50L, 71L, 37L)), 
  row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"), 
  class = "data.frame")

ggplot(sample_data) +
  geom_histogram(
    aes(x = wage,
        y = after_stat(density)),
    binwidth = 4, colour = "black"
  ) +
  stat_bin(
    aes(x = wage,
        y = after_stat(density),
        label = after_stat(ifelse(count == 0, "", count))),
    binwidth = 4, geom = "text", vjust = -1
  )

Created on 2021-03-28 by the reprex package (v1.0.0)

Upvotes: 5

Oliver
Oliver

Reputation: 8602

Personally I find the existing answers on this topic somewhat frustrating, and one that I would expect had a much simpler solution somewhere out there. I am personally not a fan of the 0's showing up in my histograms either, and positioning using stat_bin becomes frustrating at times. Having have to do this a couple of times I usually revert to some manual calculations and using geom_rect in combination with geom_text/geom_label. Maybe some day I'll sit down and actually create the, I believe, 3 functions needed to create a proper geom_*. Until then the basic idea is:

  1. Create my histogram data using hist
  2. Alter the data to a data.frame with the aesthethics needed for geom_rect (our "geom_hist" substitute) and geom_text.
  3. Plot manually with this data in the necessary layers.
#' Compute data for creating a manual histogram with ggplot including labels 
#'
#' @param bardata output from \code{hist(data, plot = FALSE)}
#' @param probs should labels be in probability scale or non-probability scales?
#' 
#' @return a \code{data.frame} with columns xmin, ymin, xmax, ymax, mids and label
create_gg_hist_df <- function(bardata, probs = TRUE){
  nb <- length(bardata$breaks)
  xmax <- bardata$breaks[-1L]
  xmin <- bardata$breaks[-nb]
  mids <- bardata$mids
  ymin <- integer(nb - 1)
  ymax <- bardata$count / sum(bardata$count)
  label <- if(!probs) ymax else bardata$count
  data.frame(xmin = xmin,
             ymin = ymin,
             xmax = xmax, 
             ymax = ymax, 
             mids = mids, 
             label = label)
}
ggbardata <- create_gg_hist_df(hist(sample_data$wage, 
                                    # breaks based on ggplot2 when "width" is supplied
                                    breaks = ggplot2:::bin_breaks_width(range(sample_data$wage), 
                                                                        width = 4)$breaks, 
                                    plot = FALSE))

ggbardata %>% 
  # Remove "0" columns ( I don't want them. That is my preference ) 
  filter(ymax > 0) %>% 
  ggplot(aes(xmin = xmin, xmax = xmax, 
                      ymin = ymin, ymax = ymax, 
                      label = label)) + 
  # Add histogram
  geom_rect(color = 'black') + 
  # Add text
  geom_text(aes(x = mids, y = ymax), nudge_y = 0.005) + 
  scale_y_continuous(labels = scales::percent_format()) + 
  labs(x = 'wage', y = 'frequency')

histogram

Upvotes: 4

Related Questions