Ujjawal Bhandari
Ujjawal Bhandari

Reputation: 1372

Pick data labels from different variable

I am creating a waterfall chart. I want to show data labels above bars. My current code is showing data labels at positions based on variable 'value'. Position of labels should be based on column 'end'. Value of labels should be based on variable 'value'.

require(tidyverse)

df <- tibble(
  label = c("Forecast YTD", "Gross Accelerated Cashflows", "Gross Delayed Cashflows", 
            "Gross Asset Outperformance", "Gross Asset Underperformance"),
  value = c(0.664, 0.554, -0.515, 0.332, -0.034)) 

df <- df %>% 
  mutate(
    end = cumsum(value),
    start = dplyr::lag(end, default = 0),
    col = c("neg", "pos", "total")[c(3, (value[-1]>0)+1)],
    id = seq(1, n())
  ) %>% 
  bind_rows(summarise(., label = "Total Actual Collections", start = 0, 
                      end = last(end), id = n()+1L, col = "total"))


  df %>% ggplot(aes(x=label, y=value, fill = col)) + 
  geom_rect(aes(x = label, xmin = id -0.45, xmax = id + 0.45, ymin = end, ymax = start)) + 
  ylab("% Gross Cash Forecasted YTD") + xlab("") + 
  geom_text(aes(label = scales::percent(value, accuracy = 1)), vjust = -0.5)  +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("#bfbfbf", "#90a9e3", "#002060"))

Upvotes: 1

Views: 164

Answers (1)

jdobres
jdobres

Reputation: 11957

You're very close! One issue is that you are setting the x-position of the percentage labels based on "label" (geom_text inherits this from the main call to ggplot), whereas the x-position of your bars is calculated from the "id" column. Since ggplot arranges the x-axis values alphabetically based on label, this throws your axes out of sync with the values in "id". One way around this would be to convert "label" to an ordered factor. ggplot will then pick up the order data, and both your bars and labels will be in sync.

The other issue is that the y-value for your labels should be set to the maximum value of either "start" or "end" for each column, whichever is greater. The following code resolves both issues (commented lines):

df <- tibble(
  label = c("Forecast YTD", "Gross Accelerated Cashflows", "Gross Delayed Cashflows", 
            "Gross Asset Outperformance", "Gross Asset Underperformance"),
  value = c(0.664, 0.554, -0.515, 0.332, -0.034)) 

df <- df %>% 
  mutate(
    end = cumsum(value),
    start = dplyr::lag(end, default = 0),
    col = c("neg", "pos", "total")[c(3, (value[-1]>0)+1)],
    id = seq(1, n()),
    max = pmax(start, end, na.rm = T) # the greater of "start" or "end" for each row
  ) %>%
  bind_rows(summarise(., label = "Total Actual Collections", start = 0, 
                      end = last(end), id = n()+1L, col = "total")) %>% 
  mutate(label = factor(label, label, ordered = T)) # convert labels to ordered factor

Some minor changes to your plot code:

df %>% ggplot(aes(x=label, y=value, fill = col)) + 
  geom_rect(aes(x = label, xmin = id -0.45, xmax = id + 0.45, ymin = end, ymax = start)) + 
  ylab("% Gross Cash Forecasted YTD") + xlab("") + 
  geom_text(aes(label = scales::percent(value, accuracy = 1), y = max), vjust = -0.5)  +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("#bfbfbf", "#90a9e3", "#002060"))

enter image description here

Upvotes: 1

Related Questions