Reputation: 6314
Trying to plot a stacked histogram using ggplot
:
set.seed(1)
my.df <- data.frame(param = runif(10000,0,1),
x = runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param, breaks = 5)
require(ggplot2)
not logging the y-axis:
ggplot(my.df,aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey()
But I want to log10+1 transform the y-axis to make it easier to read:
ggplot(my.df, aes_string(x = "x", y = "..count..+1", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
which gives:
The tick marks on the y-axis don't make sense.
I get the same behavior if I log10 transform rather than log10+1:
ggplot(my.df, aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
Any idea what is going on?
Upvotes: 3
Views: 3175
Reputation: 93821
It looks like invoking scale_y_log10
with a stacked histogram is causing ggplot to plot the product of the counts for each component of the stack within each x
bin. Below is a demonstration. We create a data frame called product.of.counts
that contains the product, within each x
bin of the counts for each param.range
bin. We use geom_text
to add those values to the plot and see that they coincide with the top of each stack of histogram bars.
At first I thought this was a bug, but after a bit of searching, I was reminded of the way ggplot does the log transformation. As described in the linked answer, "scale_y_log10
makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense."
As a simpler example, say each of five components of a stacked bar have a count of 100. Then log10(100) = 2 for all five and the sum of the logs will be 10. Then ggplot takes the anti-log for the scale, which gives 10^10 for the total height of the bar (which is 100^5), even though the actual height is 100x5=500. This is exactly what's happening with your plot.
library(dplyr)
library(ggplot2)
# Data
set.seed(1)
my.df <- data.frame(param=runif(10000,0,1),x=runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param,breaks=5)
# Calculate product of counts within each x bin
product.of.counts = my.df %>%
group_by(param.range, breaks=cut(x, breaks=seq(-0.05, 1.05, 0.1), labels=seq(0,1,0.1))) %>%
tally %>%
group_by(breaks) %>%
summarise(prod = prod(n),
param.range=NA) %>%
ungroup %>%
mutate(breaks = as.numeric(as.character(breaks)))
ggplot(my.df, aes(x, fill=param.range)) +
geom_histogram(binwidth = 0.1, colour="grey30") +
scale_fill_grey() +
scale_y_log10(breaks=10^(0:14)) +
geom_text(data=product.of.counts, size=3.5,
aes(x=breaks, y=prod, label=format(prod, scientific=TRUE, digits=3)))
Upvotes: 4