Timo
Timo

Reputation: 5390

Plot probability with ggplot2 (not density)

I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.

So, I'd like to ask:

a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.

b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?

Thanks in advance,

Timo

EDIT: when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.

Upvotes: 7

Views: 25413

Answers (3)

daroczig
daroczig

Reputation: 28612

Just making up a quick merge of @JD Long's and @yesterday's answers:

ggplot(df, aes(x=x)) +
  geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
  geom_density(fill="red", alpha = 0.2) +
  theme_bw() +
  xlab('') +
  ylab('')

enter image description here

This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as @yesterday suggested for further customization.

Upvotes: 13

JD Long
JD Long

Reputation: 60756

This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)

start with some synthetic data:

set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")

enter image description here

The density function has a reasonably smart bandwidth calculator which you can borrow from:

bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)

Then use the bandwidth calc as the SD to make some random noise

noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")

enter image description here

Hey look! A kernel smoothed histogram!

I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.

Upvotes: 11

Andrie
Andrie

Reputation: 179398

You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().

See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html You can specify any of the kernel estimation functions that are supported by by stats::density()

library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")

enter image description here

Upvotes: 7

Related Questions