Reputation: 31
I'd like to generate a histogram of some data I have, but the resultant plots with either base plot(hist())
or ggplot
histogram functions show mostly zeros, with a count number corresponding to the number of rows. There are no zeros in the actual data, and the column is class integer
. Changing the class to numeric
doesn't appear to have any effect.
The data look like this:
> head(lengths)
gene size
1 0610005C13Rik 7381
3 0610009B22Rik 3249
4 0610009E02Rik 12071
7 0610009L18Rik 2512
8 0610010F05Rik 68682
11 0610010K14Rik 2710
> dim(lengths)
[1] 25230 2
> summary(lengths)
gene size
0610005C13Rik: 1 Min. : 20
0610009B22Rik: 1 1st Qu.: 4082
0610009E02Rik: 1 Median : 13768
0610009L18Rik: 1 Mean : 177473
0610010F05Rik: 1 3rd Qu.: 37702
0610010K14Rik: 1 Max. :163098416
(Other) :25224
It's a very simple table consisting of the transcript lengths of every gene in the mouse genome, according to refFLat table from UCSC. summary()
clearly indicates there are no zeros in the size column. However, plot(hist(lengths$size))
or ggplot(lengths) + geom_histogram(aes(size))
show the vast majority of values as zeros - and it appears that the count corresponds to the number of entries in the data.
Below are the outputs from base and ggplot
histogram functions, with the following code:
> plot(hist(lengths$size))
> plot(hist(subset(lengths, size>0)$size))
> ggplot(lengths, aes(size)) + geom_histogram() + ggtitle("Lengths")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> ggplot(subset(lengths, size>0), aes(size)) + geom_histogram() + ggtitle("Lengths, subset size>0")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
base plot, subset size>0
ggplot
ggplot
, subset size>0
Sorry about the links, I'm a first time poster and don't have enough reputation to add inline images.
As you can see, even plotting a subset of the data contains no zeros, it still outputs the total count of numbers as zeros! I don't understand how to resolve this behavior, it's completely wild to me. I'm sure I'm making a simple error, but I can't seem to figure it out. Any help would be greatly appreciated.
Again, thanks in advance for anyone's help with my conundrum.
EDIT
I'm an idiot, and it's merely out of scale. Thanks to @Axeman and @user26050. Here's the plot in log10 scale, using the following code:
> ggplot(lengths, aes(log10(size))) + geom_histogram() + ggtitle("Log10(size)")
Upvotes: 3
Views: 1889
Reputation: 312
It would be great if you could provide the data frame. Then people could test their answers and post their code here. But the problem is pretty obvious from what you posted.
Histogram shows the number of observations within some range by bins. You have 25230 observations and we see that more than 25000 are counted in the very first bin. So the other bins contain less than 230 observations in sum and they are very small - we don't see them on this scale.
Suggestions for you:
1. Use more bins. The ggplot was trying to help you: stat_bin() using bins = 30. Pick better value with binwidth
. You can add either binwidth=
or bins=
inside geom_histogram
to pick parameter for the best visualisation. For example, try geom_histogram(bins=1000)
.
2. Use density plot. Just use geom_density()
instead of geom_histogram()
.
3. May be you just want some other plot?
Upvotes: 3