dylan
dylan

Reputation: 31

Histogram shows mostly zeros, but no zeros in data

I'd like to generate a histogram of some data I have, but the resultant plots with either base plot(hist()) or ggplot histogram functions show mostly zeros, with a count number corresponding to the number of rows. There are no zeros in the actual data, and the column is class integer. Changing the class to numeric doesn't appear to have any effect.

The data look like this:

> head(lengths)
            gene  size
1  0610005C13Rik  7381
3  0610009B22Rik  3249
4  0610009E02Rik 12071
7  0610009L18Rik  2512
8  0610010F05Rik 68682
11 0610010K14Rik  2710
> dim(lengths)
[1] 25230     2
> summary(lengths)
            gene            size          
 0610005C13Rik:    1   Min.   :       20  
 0610009B22Rik:    1   1st Qu.:     4082  
 0610009E02Rik:    1   Median :    13768  
 0610009L18Rik:    1   Mean   :   177473  
 0610010F05Rik:    1   3rd Qu.:    37702  
 0610010K14Rik:    1   Max.   :163098416  
 (Other)      :25224

It's a very simple table consisting of the transcript lengths of every gene in the mouse genome, according to refFLat table from UCSC. summary() clearly indicates there are no zeros in the size column. However, plot(hist(lengths$size)) or ggplot(lengths) + geom_histogram(aes(size)) show the vast majority of values as zeros - and it appears that the count corresponds to the number of entries in the data.

Below are the outputs from base and ggplot histogram functions, with the following code:

> plot(hist(lengths$size))
> plot(hist(subset(lengths, size>0)$size))
> ggplot(lengths, aes(size)) + geom_histogram() + ggtitle("Lengths")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> ggplot(subset(lengths, size>0), aes(size)) + geom_histogram() + ggtitle("Lengths, subset size>0")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

enter image description here

base plot, subset size>0 ggplot ggplot, subset size>0

Sorry about the links, I'm a first time poster and don't have enough reputation to add inline images.

As you can see, even plotting a subset of the data contains no zeros, it still outputs the total count of numbers as zeros! I don't understand how to resolve this behavior, it's completely wild to me. I'm sure I'm making a simple error, but I can't seem to figure it out. Any help would be greatly appreciated.

Again, thanks in advance for anyone's help with my conundrum.

EDIT

I'm an idiot, and it's merely out of scale. Thanks to @Axeman and @user26050. Here's the plot in log10 scale, using the following code:

> ggplot(lengths, aes(log10(size))) + geom_histogram() + ggtitle("Log10(size)")

log10 ggplot distribution

Upvotes: 3

Views: 1889

Answers (1)

aschmsu
aschmsu

Reputation: 312

It would be great if you could provide the data frame. Then people could test their answers and post their code here. But the problem is pretty obvious from what you posted.

Histogram shows the number of observations within some range by bins. You have 25230 observations and we see that more than 25000 are counted in the very first bin. So the other bins contain less than 230 observations in sum and they are very small - we don't see them on this scale.

Suggestions for you:
1. Use more bins. The ggplot was trying to help you: stat_bin() using bins = 30. Pick better value with binwidth. You can add either binwidth= or bins= inside geom_histogram to pick parameter for the best visualisation. For example, try geom_histogram(bins=1000).
2. Use density plot. Just use geom_density() instead of geom_histogram().
3. May be you just want some other plot?

Upvotes: 3

Related Questions