Reputation: 4460

Force R to plot histogram as probability (relative frequency)

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

This is in the R help, but I don't know how to override it:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

Thanks

edit: details

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?

Upvotes: 20

Answers (6)

Antoni Parellada

Reputation: 4791

set.seed(0)

# Define a fair coin:
coin = c(1,0)

# We tossed the coin 10 times and counted the number of heads. Repeat the experiment 20000 times.
n = 20000   # Number of experiments
flips = 10  # Number of coin flips in each experiment.

heads = colSums(replicate(n, sample(coin, flips, replace = T))) # Counts of heads in each experiment.

# The breaks are the number of possible outcomes: flips + 1

h = hist(heads, breaks = sort(unique(heads)), freq=F, 
          border=F, main = 'Histogram counts of heads',
          col=rgb(0.3,0.8,0.8,0.6), ylab='Probability', 
          xlab =  'No. of heads in 10 flips fair coin')

If it helps anyone landing here, check this solution:

set.seed(0)

d = rnorm(1000)
n = 1000
d = rnorm(n)

histogram = hist(d, breaks=10, prob=T, border=F)
unique(diff(histogram$breaks)) # Because the size of the base of the rectangles is 0.5, the height will be double the tru relative freq.

# The fix. Notice that I redefine the histogram simply to show how simple the call is with with this fix.
h = hist(d, plot=F)
bp = barplot(h$counts/sum(h$counts), border=F)
axis(1, at=c(bp), labels=h$mids)
title(ylab="Relative Frequency")

Thanks to this answer.

Upvotes: 1

Nagababu Tubati

Reputation: 1

I observed that, in histogram density = relative frequency / corresponding bin width

Example 1:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h2 = hist(nums, plot=F)

rf2 = h2$counts / sum(h2$counts)

d2 = rf2 / diff(h2$breaks)

h2$density

[1] 0.06 0.00 0.02 0.01 0.01

d2

[1] 0.06 0.00 0.02 0.01 0.01

Example 2:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

d3 = rf3 / diff(h3$breaks)

h3$density

[1] 0.02758621 0.01000000 0.01000000

d3

[1] 0.02758621 0.01000000 0.01000000

Upvotes: 0

CoderGuy123

Reputation: 6649

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

The solution is to call unlist() on the object first. This fixes the plot. enter image description here (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

Upvotes: -1

IRTFM

Reputation: 263342

To answer the request to plot probabilities rather than densities:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)

Upvotes: 47

Hong Ooi

Reputation: 57686

The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.

The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?

Upvotes: 3

Gavin Simpson

Reputation: 174813

Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

Upvotes: 2

Force R to plot histogram as probability (relative frequency)

Answers (6)

Related Questions