Reputation: 4201
I want to evaluate the probability density of data I've simulated.
dnorm()
in the following way:dist_mean <- 10
dist_sd <- 0.2
prob_density_on_x_val <- dnorm(x = 9.9,
mean = dist_mean,
sd = dist_sd)
prob_density_on_x_val
[1] 1.760327
dist_mean <- 10
dist_sd <- 0.2
## simulate 100,000 values from the normal distribution,
## given specific mean and standard deviation.
set.seed(123)
random_vals <- rnorm(n = 100000,
mean = dist_mean,
sd = dist_sd)
hist(random_vals)
library("pracma")
trapz(random_vals)
random_vals
[1] 1000009
In this comment, @Glen_b says using ecdf()
is the way to calculate probability in a range between two x values "a" and "b": ecdf(b)-ecdf(a)
. However, something doesn't make sense, because:
cdf <- ecdf(random_vals)
range_density <- cdf(10.2)-cdf(9.7)
range_density
[1] 0.77358
How is it possible that the probability density on a point value (x=9.9) was 1.76, but for a range 9.7<x<10.2
it's smaller (0.77)? Both distributions (both the one defined with dnorm
and the one simulated with rnorm
) have the same mean and sd.
So I think I'm missing something fundamental, and would be grateful for any help. Overall, it seems like a very simple question, but I can't find a straightforward solution despite a lot of reading and digging.
Thanks!
The thing I was missing was the distinction between:
dnorm()
is useful for)Upvotes: 2
Views: 2194
Reputation: 37641
You can get the probability density function using the functions density
and approxfun
.
DensityFunction = approxfun(density(random_vals), rule=2)
DensityFunction(9.7)
[1] 0.6410087
plot(DensityFunction, xlim=c(9,11))
You can get the area under the curve using integrate
AreaUnderCurve = function(lower, upper) {
integrate(DensityFunction, lower=lower, upper=upper) }
AreaUnderCurve(10,11)
0.5006116 with absolute error < 6.4e-05
AreaUnderCurve(9.5,10.5)
0.9882601 with absolute error < 0.00011
You also ask:
How is it possible that that the probability density on point value (x=9.9) was 1.76, but for a range 9.7
The value of the pdf (1.76) is the height of the curve. The value you are getting for the range is the area under the curve. Since the width of the interval is 0.5, it is not a surprise that the area under the curve is less than the height.
Upvotes: 2
Reputation: 8846
It doesn't make sense to calculate the probability for a single value in a continuous probability function, it is by definition zero, but you can calculate relative likelihoods. You react on random_vals
not summing to one, but not that prob_density_on_x_val
is more than one?
Glen is of course correct in that ecdf()
is the way to go for a non-parametric estimate, but if you expect a normal distribution you can also do a parametric estimate.
dist_mean <- 10
dist_sd <- 0.2
a <- 9.7
b <- 10.2
set.seed(123)
r <- rnorm(1e4, dist_mean, dist_sd)
# population
pnorm(b, dist_mean, dist_sd) - pnorm(a, dist_mean, dist_sd)
# [1] 0.7745375
# parametric estimate
pnorm(b, mean(r), sd(r)) - pnorm(a, mean(r), sd(r))
# [1] 0.7753985
# nonparametric estimate
ecdfun <- ecdf(r)
ecdfun(b) - ecdfun(a)
# [1] 0.7754
Upvotes: 3