Reputation: 5390
I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.
So, I'd like to ask:
a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.
b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?
Thanks in advance,
Timo
EDIT: when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.
Upvotes: 7
Views: 25413
Reputation: 28612
Just making up a quick merge of @JD Long's and @yesterday's answers:
ggplot(df, aes(x=x)) +
geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
geom_density(fill="red", alpha = 0.2) +
theme_bw() +
xlab('') +
ylab('')
This way the binwidth for ggplot2
was calculated by the density
function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as @yesterday suggested for further customization.
Upvotes: 13
Reputation: 60756
This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)
start with some synthetic data:
set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")
The density function has a reasonably smart bandwidth calculator which you can borrow from:
bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)
Then use the bandwidth calc as the SD to make some random noise
noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")
Hey look! A kernel smoothed histogram!
I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.
Upvotes: 11
Reputation: 179398
You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().
See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html You can specify any of the kernel estimation functions that are supported by by stats::density()
library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")
Upvotes: 7