Emman
Emman

Reputation: 4201

Estimating probability density in a range between two x values on simulated data

I want to evaluate the probability density of data I've simulated.

  1. I know that if I simply want to find the probability density for a single x value on the normal distribution, I could use dnorm() in the following way:
dist_mean <- 10
dist_sd <- 0.2
prob_density_on_x_val <- dnorm(x = 9.9,
                               mean = dist_mean,
                               sd = dist_sd)

prob_density_on_x_val

[1] 1.760327
  1. But what if I want to evaluate the probability density for a range between two x values in a simulated data?
dist_mean <- 10
dist_sd <- 0.2

## simulate 100,000 values from the normal distribution, 
## given specific mean and standard deviation.
set.seed(123)
random_vals <- rnorm(n = 100000,
                     mean = dist_mean,
                     sd = dist_sd)


hist(random_vals)

histogram

  1. My 100,000 generated values are raw values, and they do take a normal shape. However, this is not a probability density function, as the area under the curve isn't equal to 1.
library("pracma")
trapz(random_vals)

random_vals
[1] 1000009

My questions:

  1. Given my simulated data, how can I create a probability density function for it?
  2. Once created, how could I estimate the: (1) probability under the curve, and (2) probability density on the curve, for a range between two x values? For example, the probability and probability density between x=9.7 and 10.2. Or any other range.

My attempt to figure this out:

In this comment, @Glen_b says using ecdf() is the way to calculate probability in a range between two x values "a" and "b": ecdf(b)-ecdf(a). However, something doesn't make sense, because:

cdf <- ecdf(random_vals)
range_density <- cdf(10.2)-cdf(9.7)

range_density
[1] 0.77358

How is it possible that the probability density on a point value (x=9.9) was 1.76, but for a range 9.7<x<10.2 it's smaller (0.77)? Both distributions (both the one defined with dnorm and the one simulated with rnorm) have the same mean and sd.

So I think I'm missing something fundamental, and would be grateful for any help. Overall, it seems like a very simple question, but I can't find a straightforward solution despite a lot of reading and digging.

Thanks!

Edit

The thing I was missing was the distinction between:

Upvotes: 2

Views: 2194

Answers (2)

G5W
G5W

Reputation: 37641

You can get the probability density function using the functions density and approxfun.

DensityFunction = approxfun(density(random_vals), rule=2)
DensityFunction(9.7)
[1] 0.6410087
plot(DensityFunction, xlim=c(9,11))

Probability density function

You can get the area under the curve using integrate

AreaUnderCurve = function(lower, upper) {
    integrate(DensityFunction, lower=lower, upper=upper) }

AreaUnderCurve(10,11)
0.5006116 with absolute error < 6.4e-05
AreaUnderCurve(9.5,10.5)
0.9882601 with absolute error < 0.00011

You also ask:

How is it possible that that the probability density on point value (x=9.9) was 1.76, but for a range 9.7

The value of the pdf (1.76) is the height of the curve. The value you are getting for the range is the area under the curve. Since the width of the interval is 0.5, it is not a surprise that the area under the curve is less than the height.

Upvotes: 2

AkselA
AkselA

Reputation: 8846

It doesn't make sense to calculate the probability for a single value in a continuous probability function, it is by definition zero, but you can calculate relative likelihoods. You react on random_vals not summing to one, but not that prob_density_on_x_val is more than one?

Glen is of course correct in that ecdf() is the way to go for a non-parametric estimate, but if you expect a normal distribution you can also do a parametric estimate.

dist_mean <- 10
dist_sd <- 0.2
a <- 9.7
b <- 10.2

set.seed(123)
r <- rnorm(1e4, dist_mean, dist_sd)

# population
pnorm(b, dist_mean, dist_sd) - pnorm(a, dist_mean, dist_sd)
# [1] 0.7745375

# parametric estimate
pnorm(b, mean(r), sd(r)) - pnorm(a, mean(r), sd(r))
# [1] 0.7753985

# nonparametric estimate
ecdfun <- ecdf(r)
ecdfun(b) - ecdfun(a)
# [1] 0.7754

Upvotes: 3

Related Questions