user14692575
user14692575

Reputation:

Illustrate standard deviation in histogram

Consider the following simple example:

# E. Musk in Grunheide 
set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n = 1000, mean = 10, sd = 10)

# empirical sd 
sd(randomNumbers)
#> [1] 10.34369

# histogram 
hist(randomNumbers, probability = TRUE, main = "", breaks = 50)

# just for illusatration purpose 
###
# empirical density 
lines(density(randomNumbers), col = 'black', lwd = 2)
# theortical density 
curve(dnorm(x, mean = 10, sd = 10), col = "blue", lwd = 2, add = TRUE)
###

Created on 2022-03-22 by the reprex package (v2.0.1)

Question: Is there a nice way to illustrate the empirical standard deviation (sd) in the histogram by colour? E.g. representing the inner bars by a different color, or indicating the range of the sd by an interval, i.e., [mean +/- sd], on the x-axis?

Note, if ggplot2 provides an easy solution, suggesting this would be also much appreciated.

Upvotes: 1

Views: 3029

Answers (3)

Allan Cameron
Allan Cameron

Reputation: 173803

This is similar ggplot solution to Benson's answer, except we precompute the histogram and use geom_col, so that we don't get any of the unwelcome stacking at the sd boundary:

# E. Musk in Grunheide 
set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n=1000, mean=10, sd=10)

h <- hist(randomNumbers, breaks = 50, plot = FALSE)

lower <- mean(randomNumbers) - sd(randomNumbers)
upper <- mean(randomNumbers) + sd(randomNumbers)

df <- data.frame(x = h$mids, y = h$density, 
                 fill = h$mids > lower & h$mids < upper)

library(ggplot2)

ggplot(df) +
  geom_col(aes(x, y, fill = fill), width = 1, color = 'black') +
  geom_density(data = data.frame(x = randomNumbers), 
               aes(x = x, color = 'Actual density'),
               key_glyph = 'path') +
  geom_function(fun = function(x) {
    dnorm(x, mean = mean(randomNumbers), sd = sd(randomNumbers)) },
    aes(color = 'theoretical density')) +
  scale_fill_manual(values = c(`TRUE` = '#FF374A', 'FALSE' = 'gray'), 
                    name = 'within 1 SD') +
  scale_color_manual(values = c('black', 'blue'), name = 'Density lines') +
  labs(x = 'Value of random number', y = 'Density') +
  theme_minimal()

enter image description here

Upvotes: 5

benson23
benson23

Reputation: 19097

Here is a ggplot solution. First calculate mean and sd, and save the values in different vectors. Then use an ifelse statement to categorise the values into "Within range" and "Outside range", fill them with different colours.

Blue line represents the normal distribution stated in your question, and black line represents the density graph of the histogram we're plotting.

library(ggplot2)

set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n=1000, mean=10, sd=10)

randomNumbers_mean <- mean(randomNumbers)
randomNumbers_sd <- sd(randomNumbers)

ggplot(data.frame(randomNumbers = randomNumbers), aes(randomNumbers)) +
  geom_histogram(aes(
    fill = ifelse(
      randomNumbers > randomNumbers_mean + randomNumbers_sd |
        randomNumbers < randomNumbers_mean - randomNumbers_sd,
      "Outside range",
      "Within range"
    )
  ), 
  binwidth = 1, col = "gray") +
  geom_density(aes(y = ..count..)) + 
  stat_function(fun = function(x) dnorm(x, mean = 10, sd = 10) * 1000,
                color = "blue") +
  labs(fill = "Data")

Created on 2022-03-22 by the reprex package (v2.0.1)

Upvotes: 3

Stefano Barbi
Stefano Barbi

Reputation: 3194

data.frame(rand = randomNumbers,
           cut = {
             sd <- sd(randomNumbers)
             mn <- mean(randomNumbers)
             cut(randomNumbers, c(-Inf, mn -sd, mn +sd, Inf))
           }) |>
  ggplot(aes(x = rand, fill = cut ) ) +
  geom_histogram()

enter image description here

Upvotes: 2

Related Questions