Katherine Chau
Katherine Chau

Reputation: 443

How to colour histogram according to quantile range?

I want to colour my histogram plot to show the quantile range distribution but I am unable to figure out where/how to add the colours into my code.

Here is my sample data and code (using iris dataset).

# Get summary info and quantile range info for data
summary(iris)
vals<- c(q1=quantile(iris$Sepal.Length, .25),
          q2=quantile(iris$Sepal.Length, .50),
          q3=quantile(iris$Sepal.Length, .75))

# Create histogram and colour bins according to quantile range; add mean/med lines
h <- ggplot(iris, aes(Sepal.Length), fill=vals) +
  geom_histogram(color="black", bins=10) +
  scale_x_continuous(breaks=seq(0, 10, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(0,60)) + theme_bw() +
  labs(x="Length", y="Frequency") +
  theme(axis.title.x = element_text(size=12),
        axis.text.x = element_text(size=12),
        axis.title.y = element_text(size=12),
        axis.text.y = element_text(size=12))

h + geom_vline(xintercept = median(iris$Sepal.Length),col="red", size=1.1, lty="solid") +
  geom_vline(xintercept = mean(iris$Sepal.Length),col="blue", size=1.1, lty="solid")

I've tried several attempts of changing the "fill" function in different parts of ggplot(aes) and in geom_hist() but the resulting figure always comes up dark grey or doesn't work.

I just want to colour the bins to show the quantile range, so 4 colours on the figure total.

UPDATE From the two answers posted below it is clear that showing quantile color on the histogram is not the best approach. I may just stick to showing lines instead but I thank both users for their posts - both worked well for me and I will probably use this code for regular ggplots rather than specifically histogram plots.

Upvotes: 1

Views: 689

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 174476

This entire concept seems problematic. Usually histograms have equally spaced bars of a fixed width, but quantiles are typically not evenly spaced. It is therefore not possible in the general case to align fixed-width histogram breaks to quantiles in a way that would allow for quantile coloring.

It's fairly easy if you are prepared to have a flexible binwidth, but then you would need density rather than counts on the y axis to ensure the areas remain proportional:

ggplot(iris, aes(Sepal.Length)) +
  geom_histogram(breaks = quantile(iris$Sepal.Length, seq(0, 1, 0.125)),
                 aes(fill = cut(Sepal.Length, labels = paste0("Q", 1:4),
                                quantile(Sepal.Length, seq(0, 1, 0.25)),
                                include.lowest = TRUE), 
                     y = after_stat(density)), color = "black") +
  scale_fill_viridis_d("Quantile") +
  labs(x = "Length") +
  theme_bw(base_size = 16) 

enter image description here

An alternative is to split the colors of each bar. This requires considerable precalculation. I have wrapped the calculation in the following function:

quant_hist <- function(x, bins = 15) {
  breaks <- hist(x, breaks = bins)$breaks
  bins   <- cut(x, breaks = breaks, include.lowest = TRUE)
  quants <- cut(x, quantile(x, seq(0, 1, 0.25)), labels = paste0("Q", 1:4),
                include.lowest = TRUE)
  df <- data.frame(x, bins = bins, q = quants) %>%
    group_by(bins, q) %>%
    count() %>%
    ungroup() %>%
    mutate(ymax = sum(n), .by = bins) %>%
    mutate(left = sapply(strsplit(as.character(bins), ","),
                         \(x) as.numeric(substr(x[1], 2, 100))),
           right = sapply(strsplit(as.character(bins), ","),
                          \(x) as.numeric(substr(x[2], 1, nchar(x[2]) - 1)))) |>
    group_by(bins) |>
    mutate(xmin = if(n() == 1) left else 
      left + c(0, first(n)/first(ymax) * (first(right) - first(left))),
      xmax = if(n() == 1) right else c(last(xmin), first(right)), ymin = 0) %>%
    ungroup() %>%
    select(xmin, xmax, ymin, ymax, q, bins)
  
  ggplot(df, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax)) +
    geom_rect(aes(fill = q)) +
    geom_rect(data = df %>% summarize(xmin = min(xmin), xmax = max(xmax),
                                     ymin = 0, ymax = first(ymax), .by = bins),
              fill = NA, color = "black")
}

This allows the shape of the histogram and its quantile fill to be completely separate, for example:

quant_hist(iris$Sepal.Length, bins = 8) +
  scale_fill_viridis_d("Quantile") +
  labs(x = "Length") +
  theme_bw(base_size = 16) 

enter image description here

And

quant_hist(iris$Sepal.Length, bins = 16) +
  scale_fill_brewer("Quantile", palette = "Spectral") +
  labs(x = "Length") +
  theme_minimal(base_size = 16) 

enter image description here

Upvotes: 3

a11
a11

Reputation: 3396

Like @Allan Cameron said, histogram isn't the best tool for this because you would likely need unequal bin sizes to place strict boundaries based on the quantile. For example, just look at the entire Sepal Length dataset and see that your quantiles fall within bins, not at bin edges. You can try this for a range of binwidths and you will see that for this dataset (and most datasets) that the quantiles will not fall at the edges.

# Calculate quantiles
quantiles <- quantile(iris$Sepal.Length, probs = c(0.25, 0.5, 0.75))

# Create a data frame to help with labeling the quantile lines
quantile_df <- data.frame(
  quantile = quantiles,
  label = c("25th percentile", "50th percentile", "75th percentile")
)

# Plot histogram of all data
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.3, fill = "skyblue", color = "black") +
  geom_vline(data = quantile_df, aes(xintercept = quantile, linetype = label), color = "red", size = 1) +
  scale_linetype_manual(values = c("dotted", "solid", "dashed")) +
  labs(title = "Histogram of Iris Sepal Length",
       x = "Sepal Length (cm)",
       y = "Count") +
  theme_minimal() +
  guides(linetype = guide_legend(title = "Quantiles"))

enter image description here

If you still want to color-code by quantile, you can use cut to assign each observation to a range and then plot:

# Calculate quantiles with bounding values
quantiles2 <- quantile(iris$Sepal.Length, probs = c(0, 0.25, 0.5, 0.75, 1))

# Assign each observations to a quantile range
iris$Sepal.Length.Quantile <- cut(iris$Sepal.Length, breaks = quantiles2, 
                            labels = c("0-25%", "25-50%", "50-75%", "75-100%"),
                            include.lowest = TRUE)

# Plot the histogram
ggplot(iris, aes(x = Sepal.Length, fill = Sepal.Length.Quantile)) +
  geom_histogram(position = "stack", alpha = 0.6, binwidth = 0.3, color = "black") +
  scale_fill_brewer(palette = "Dark2", name = "Quantile Range") +
  labs(title = "Histogram of Iris Sepal Length",
       x = "Sepal Length (cm)",
       y = "Count") +
  theme_minimal()

enter image description here

It is unusual to present data like this. The closest thing I can think of that is close to what you are asking for is to color-code the PDF of the data. You can display a histogram in the background. However, this is still an odd way to present data.

# Estimate density
dens <- density(iris$Sepal.Length)
dd <- with(dens, data.frame(x, y))

# Calculate quantiles with bounding values
quantiles2 <- quantile(iris$Sepal.Length, probs = c(0, 0.25, 0.5, 0.75, 1))

# Assign each observations to a quantile range
dd$quantile_range <- with(dd, cut(x, breaks = quantiles2, 
                                  labels = c("0-25%", "25-50%", "50-75%", "75-100%"),
                                  include.lowest = TRUE))


# Plotting
ggplot() +
  geom_histogram(data = iris, aes(x = Sepal.Length, y = ..density..), 
                 binwidth = 0.3, fill = "grey", alpha = 0.25, color = "black") +
  geom_line(data = dd, aes(x = x, y = y), size = 1.5) +
  geom_ribbon(data = dd, aes(x = x, ymax = y, ymin = 0, fill = quantile_range), alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", name = "Data Quantile Range") +
  
  labs(title = "Distribution of Iris Sepal Length with Quantile Ranges",
       x = "Sepal Length (cm)", y = "Probability Density") +
  theme_minimal()

enter image description here

Upvotes: 3

Related Questions