Reputation: 1748
I recently discovered some odd behavior in ggplot2 by accident. The following code
N <- 1000
coin <- rep(c(0,1),N/2)
N1 <- sum(coin)
N0 <- sum(1-coin)
values <- rep(0,N)
values[coin==0] <- rnorm(N0,mean=0,sd=1)
values[coin==1] <- rnorm(N1,mean=0,sd=1)
dat = data.frame('Value'=values,'Category'=as.factor(coin))
creates a dataset that has one numeric column and one factor column, with equal numbers of events belonging to each of the two categories:
> summary(dat)
Value Category
Min. :-3.901785 0:500
1st Qu.:-0.669807 1:500
Median : 0.020031
Mean :-0.008229
3rd Qu.: 0.650803
Max. : 3.195819
However, when plotting the Value column broken down by category, category 1 appears with a much greater normalization than category 0:
ggplot(dat,aes(x=Value,fill=Category)) + geom_histogram(alpha=0.5) + theme_bw()
This appears very odd. The bin widths appear equal for the two histograms, as they should, but the total counts of events are not equal, as they should be. The category 0 histogram is in fact the histogram of the entire dataset:
ggplot(dat,aes(x=Value)) + geom_histogram(alpha=0.5) + theme_bw()
Is this a ggplot2 bug, or am I making some mistake I haven't noticed? (I get the same thing if I replace categories 0 and 1 with 'A' and 'B' by the way).
System details:
Upvotes: 3
Views: 52
Reputation: 66864
geom_histogram
defaults to stacking the bars atop one another via the argument position="stack"
. This is useful to see the overall composition and the contributions of each part at the same time, but not so useful for comparing the parts directly. You can override this by changing the position argument to "identity"
, eg:
ggplot(dat,aes(x=Value,fill=Category)) +
geom_histogram(alpha=0.5, position="identity") + theme_bw()
Upvotes: 5