eaglefreeman
eaglefreeman

Reputation: 854

Why are R hist and ggplot histograms output so different?

I got some strange results trying to plot the histograms of a pretty standard random variable with ggplot.

RB = rbinom(10000000,100,.3)
qplot(RB)#histogram of the distrib with ggplot2. Assumes 30 buckets by default
dev.new()
hist(RB,breaks=30)#same with regular histogram
dev.new()
qplot(RB,binwidth=1)#if we force binwidth to 1 it seems to work
dev.new()
hist(RB,breaks=range(RB)[2]-range(RB)[1])

The result of the first call to qplot is quite odd. With the numbers of draws we expected the graph to show a smooth distribution.

I use ggplot2 version 1.0.0 and R 3.0.2

Upvotes: 1

Views: 3331

Answers (1)

tonytonov
tonytonov

Reputation: 25608

By default, ggplot uses range/30 as binwidth, as prompted. In your case, it is approximately 48/30 (depends on the seed), which is more than 1 and is around 1.5.

Now, your data is not continuous, you only get integers, so for any two adjacent histogram bins you'll get irregularities, caused by the fact that the first bin will only contain one possible integer, and the next will contain two, and so on. As a result, you'll see the count approximately doubled for every second bin.

Say, your data looks like

1 2 3 4 5 6
5 5 5 5 5 5

and if you start counting from 0.5, you'll get these bins:

(0.5, 2] (2, 3.5] (3.5 5] (5, 6.5]
      10        5      10        5

which is exactly those spikes you see on the first of your plots.

As you have already found out, this won't be a problem if binwidth is strictly 1.

Edit:

as pointed out by @James, use the following to reproduce the picture given by ggplot with base graph:

hist(RB, breaks=seq(min(RB), max(RB), length.out=30))

It may look a bit different, but the spikes are there.

Upvotes: 6

Related Questions