Stefan Klocke
Stefan Klocke

Reputation: 139

Preventing wrong density plots when coloring histograms according to groups

based on some dummy data I created a histogram with desity plot

set.seed(1234)
wdata = data.frame(
  sex = factor(rep(c("F", "M"), each=200)),
  weight = c(rnorm(200, 55), rnorm(200, 58))
)
a <- ggplot(wdata, aes(x = weight))

a + geom_histogram(aes(y = ..density..,
                       # color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               # aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Basic Result

The histogram of weight shall be colored corresponding to sex, so I use aes(y = ..density.., color = sex) for geom_histogram():

a + geom_histogram(aes(y = ..density..,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               # aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Scaled individual histograms (not desired)

As I want it to, the density plot stays the same (overall for both groups), but the histograms jump scale up (and seem to be treated individually now):

How do I prevent this from happening? I need individually colored histogram bars but a joint density plot for all coloring groups.

P.S. Using aes(color = sex) for geom_density() gets everything back to original scales - but I don't want individual density plots (like below):

a + geom_histogram(aes(y = ..density..,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Individual densities (not desired)

EDIT:

As it has been suggested, dividing by the number of groups in geom_histogram()'s aesthetics with y = ..density../2 may approximate the solution. Nevertheless, this only works with symmetric distributions like in the first output below:

a + geom_histogram(aes(y = ..density../2,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

which yields

Solution

Less symmetric distributions, however, may cause trouble using this approach. See those below, where for 5 groups, y = ..density../5 was used. First original, then manipulation (with position = "stack"): Original

Divided by 5

Since the distribution is heavy on the left, dividing by 5 underestimates on the left and overestimates on the right.

EDIT 2: SOLUTION

As suggested by Andrew, the below (complete) code solves the problem:

library(ggplot2)
set.seed(1234)
wdata = data.frame(
  sex = factor(rep(c("F", "M"), each = 200)),
  weight = c(rnorm(200, 55), rnorm(200, 58))
)

binwidth <- 0.25
a <- ggplot(wdata,
            aes(x = weight,
                # Pass binwidth to aes() so it will be found in
                # geom_histogram()'s aes() later
                binwidth = binwidth))

# Basic plot w/o colouring according to 'sex'
a + geom_histogram(aes(y = ..density..),
                   binwidth = binwidth,
                   colour = "black",
                   fill = "white",
                   position = "stack") +
  geom_density(alpha = 0.2) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF")) +
  # Use fixed scale for sake of comparability
  scale_x_continuous(limits = c(52, 61)) +
  scale_y_continuous(limits = c(0, 0.25))


# Plot w/ colouring according to 'sex'
a + geom_histogram(aes(x = weight,
                       # binwidth will only be found if passed to
                       # ggplot()'s aes() (as above)
                       y = ..count.. / (sum(..count..) * binwidth),
                       color = sex),
                   binwidth = binwidth,
                   fill="white",
                   position = "stack") +
  geom_density(alpha = 0.2) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF")) +
  # Use fixed scale for sake of comparability
  scale_x_continuous(limits = c(52, 61)) +
  scale_y_continuous(limits = c(0, 0.25)) +
  guides(color = FALSE)

Note: binwidth = binwidth needed to be passed to ggplot()'s aes(), otherwise the pre-specified binwidth would not be found by geom_histogram()'s aes(). Further, position = "stack" is specified, so that both versions of the histogram are comparable. Plots for dummy data and the more complex distribution below:

Correct, ungrouped, simple data

Correct, grouped, simple data

Correct, ungrouped, more complex distribution

Correct, grouped, more complex distribution

Solved - Thanks for your help!

Upvotes: 3

Views: 711

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

I don't think you can do it using y=..density.., but you can recreate the same thing like this...

binwidth <- 0.25 #easiest to set this manually so that you know what it is

a + geom_histogram(aes(y = ..count.. / (sum(..count..) * binwidth),
                       color = sex), 
                   binwidth = binwidth,
                   fill="white",
                   position = "identity") +
    geom_density(alpha = 0.2) +
    scale_color_manual(values = c("#868686FF", "#EFC000FF"))

enter image description here

Upvotes: 1

Related Questions