SwimBikeRun
SwimBikeRun

Reputation: 4460

Force R to plot histogram as probability (relative frequency)

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

This is in the R help, but I don't know how to override it:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

Thanks

edit: details

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?

Upvotes: 20

Views: 66702

Answers (6)

Antoni Parellada
Antoni Parellada

Reputation: 4791

set.seed(0)

# Define a fair coin:
coin = c(1,0)

# We tossed the coin 10 times and counted the number of heads. Repeat the experiment 20000 times.
n = 20000   # Number of experiments
flips = 10  # Number of coin flips in each experiment.

heads = colSums(replicate(n, sample(coin, flips, replace = T))) # Counts of heads in each experiment.

# The breaks are the number of possible outcomes: flips + 1

h = hist(heads, breaks = sort(unique(heads)), freq=F, 
          border=F, main = 'Histogram counts of heads',
          col=rgb(0.3,0.8,0.8,0.6), ylab='Probability', 
          xlab =  'No. of heads in 10 flips fair coin')

enter image description here


If it helps anyone landing here, check this solution:

set.seed(0)

d = rnorm(1000)
n = 1000
d = rnorm(n)

histogram = hist(d, breaks=10, prob=T, border=F)
unique(diff(histogram$breaks)) # Because the size of the base of the rectangles is 0.5, the height will be double the tru relative freq.

# The fix. Notice that I redefine the histogram simply to show how simple the call is with with this fix.
h = hist(d, plot=F)
bp = barplot(h$counts/sum(h$counts), border=F)
axis(1, at=c(bp), labels=h$mids)
title(ylab="Relative Frequency")

Thanks to this answer.

Upvotes: 1

Nagababu Tubati
Nagababu Tubati

Reputation: 1

I observed that, in histogram density = relative frequency / corresponding bin width

Example 1:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h2 = hist(nums, plot=F)

rf2 = h2$counts / sum(h2$counts)

d2 = rf2 / diff(h2$breaks)

h2$density

[1] 0.06 0.00 0.02 0.01 0.01

d2

[1] 0.06 0.00 0.02 0.01 0.01

Example 2:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

d3 = rf3 / diff(h3$breaks)

h3$density

[1] 0.02758621 0.01000000 0.01000000

d3

[1] 0.02758621 0.01000000 0.01000000

Upvotes: 0

CoderGuy123
CoderGuy123

Reputation: 6649

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

The solution is to call unlist() on the object first. This fixes the plot. enter image description hereenter image description here (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

Upvotes: -1

IRTFM
IRTFM

Reputation: 263342

To answer the request to plot probabilities rather than densities:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)

Upvotes: 47

Hong Ooi
Hong Ooi

Reputation: 57686

The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?

Upvotes: 3

Gavin Simpson
Gavin Simpson

Reputation: 174813

Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

Upvotes: 2

Related Questions