Reputation: 25
I wanted to get a smooth estimate of a cumulative distribution function. One of ways to do this is to integrate a kernel density estimator, getting a kernel distribution estimator. In order to get one, I used the kde
function from the "kerdiest"
package.
The problem is that I have to specify a grid which affects the results greatly. The default choice of grid leads to a graph that differs from the plot of empirical distribution function significantly (see the picture; white dots represent the empirical c.d.f.). I can pick up grid values so that the kernel estimator and ecdf would coincide but I do not understand how it works.
So, what is the grid and how should it be chosen? Is there any other way to get a kernel estimator of a distribution function?
The data I have been experimenting with is waiting times of the Old Faithful Geyser dataset in R
.
The code is
x <- faithful$waiting
library("kerdiest")
n = length(x)
kcdf <- kde(type_kernel = "n", x, bw = 1/sqrt(n))
plot(kcdf$Estimated_values)
lines(ecdf(x))
Upvotes: 0
Views: 256
Reputation: 263362
Instead of plotting with the default plot function you should be using both the Estimated_values
and the grid
values to form the initial plot. The the lines
function will have the correct x-values . (The clue here is the labeling of your plot. When seeing the "Index" label, you might have wondered whether it was the correct scale. When plot gets a single vector of numeric values it uses their ordering sequence as the "Index" value, so you see integers: 1:length(vector)
)
with( kcdf, plot(Estimated_values ~ grid) ) # using plot.formula
lines(ecdf(x))
Upvotes: 1