Reputation: 663
I have the following distribution:
x<-c(22.5,28.14285714,33.78571429,39.42857143,45.07142857,50.71428571,56.35714286,62,67.64285714,73.28571429,78.92857143,84.57142857,90.21428571,95.85714286,101.5,107.1428571,112.7857143,118.4285714,124.0714286,129.7142857,135.3571429,141,146.6428571,152.2857143,157.9285714,163.5714286,169.2142857,174.8571429,180.5,186.1428571,191.7857143,197.4285714,203.0714286,208.7142857,214.3571429,220,225.6428571,231.2857143,236.9285714,242.5714286,248.2142857,253.8571429,259.5,265.1428571,270.7857143,276.4285714,282.0714286,287.7142857,293.3571429,299)
y<-c(0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.00328839614285714,0.00296425985714286,0.002655899,0.00236187857142857,0.002080895,0.00181184271428571,0.00155376085714286,0.00130578928571429,0.001074706,0.000877193,0.000709397142857142,0.000567189714285714,0.000447254,0.000346858571428571,0.000263689142857143,0.000195768428571429,0.000141427,9.92657142857141e-05,6.77857142857142e-05,4.48571428571428e-05,2.86428571428571e-05,1.75142857142857e-05,1.01357142857143e-05,5.52e-06,2.78857142857142e-06,1.27285714285713e-06,5.00714285714284e-07,1.5742857142857e-07,3.29857142857142e-08,2.78857142857137e-09,1.74e-12)
plot(x,y)
I would like to find the value of x
that separates an area of 0.95 under the distribution to the left and 0.05 of the area to the right (one-tailed 95% interval of credibility).
I guess I have to fit my empirical curve to a function and then integrate the function so I can obtain the desired value, but I don't really know from where to start.
How can this possibly be done in R?
Upvotes: 2
Views: 399
Reputation: 58845
As other answers have indicated, this is an integration under the curve problem, paired with determining where the area reaches 95% of the total area. I take a simpler approach to the integration than David's answer does. Rather than interpolate a curve and integrate that, I just use the trapezoid integration rules to get the area contributed by each interval. Those individual areas are then added to get the total area. Then the index where the cumulative area surpasses 95% of the total area is found and a line can be drawn with that.
piece_area <- c(0, (x[-1] - x[-length(x)])*(y[-1] + y[-length(y)]) / 2)
cum_area <- cumsum(piece_area)
total_area <- cum_area[length(cum_area)]
idx095 <- min(which(cum_area > 0.95 * total_area))
abline(v = x[idx095])
Higher resolution of the exact point where 95% is crossed can be obtained by using more points in the original sample of the distribution.
Upvotes: 4
Reputation: 1920
It is an integration problem ( sum under the curve ). You can divide your integration into a square one + the curve one. However, you can use a quick and dirty approximation via splines:
y<-c(0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.00328839614285714,0.00296425985714286,0.002655899,0.00236187857142857,0.002080895,0.00181184271428571,0.00155376085714286,0.00130578928571429,0.001074706,0.000877193,0.000709397142857142,0.000567189714285714,0.000447254,0.000346858571428571,0.000263689142857143,0.000195768428571429,0.000141427,9.92657142857141e-05,6.77857142857142e-05,4.48571428571428e-05,2.86428571428571e-05,1.75142857142857e-05,1.01357142857143e-05,5.52e-06,2.78857142857142e-06,1.27285714285713e-06,5.00714285714284e-07,1.5742857142857e-07,3.29857142857142e-08,2.78857142857137e-09,1.74e-12)
x<-c(22.5,28.14285714,33.78571429,39.42857143,45.07142857,50.71428571,56.35714286,62,67.64285714,73.28571429,78.92857143,84.57142857,90.21428571,95.85714286,101.5,107.1428571,112.7857143,118.4285714,124.0714286,129.7142857,135.3571429,141,146.6428571,152.2857143,157.9285714,163.5714286,169.2142857,174.8571429,180.5,186.1428571,191.7857143,197.4285714,203.0714286,208.7142857,214.3571429,220,225.6428571,231.2857143,236.9285714,242.5714286,248.2142857,253.8571429,259.5,265.1428571,270.7857143,276.4285714,282.0714286,287.7142857,293.3571429,299)
sp=smooth.spline(x,y)
f = function(t)
{
predict(sp,t)$y
}
N=500 # this is an accuracy parameter
xBis=seq(x[1],x[length(x)],length=N)
yBis=sapply(x,f)
J = function (input)
{ # This function takes input in 1:N
Integral = 0
dx=(x[length(x)]-x[1])/N
for ( j in 1: input)
{ z=xBis[j]
Integral=Integral+ f(x[1]+z)*dx
}
J=Integral
}
######
I=J(N) # This is the value of the sum under the curve
# It should be roughly equal (given the shape of the curve) to:
index=max(which(y==y[1]))
I = (x[index]-x[1])*(y[index])*3/2
######
res=sapply(1:N,J)/I
Index5=max(which(res<=.05))
Index95=min(which(res>=.95))
x5=xBis[Index5] # This is the 5% quantile
x95=xBis[Index95]
HTH
Let me know if anything is unclear
P.S I think there are far better ways to do it ..
Upvotes: 2