Reputation: 3335
I am having trouble understanding how to implement a Gaussian kernel density estimation of the following dataset in R. I appreciate if you can help me understand the mechanism of how to do it. I am currently trying to get a formula for the bell shaped curves at the bottom of the following picture. As you can see there is one bell shaped curve for each data point. (Note the picture does not represent the data I am using.)
This is my data:
x<-c(4.09, 4.46, 4.61, 4.30, 4.03, 5.22, 4.21, 4.07, 4.02, 4.58, 4.66, 4.05, 4.23, 5.51, 4.03, 4.72, 4.47, 4.50, 5.80, 4.30, 4.09, 4.78, 4.18, 4.45, 4.40, 5.60, 4.37, 4.42, 4.88, 4.20, 4.45, 4.10, 4.43, 4.58, 4.40, 4.38)
(x has 36 elements)
This is the kernel density estimator:
(If you can't see the image, it's from this page http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/xlghtmlnode33.html)
is the Gaussian kernel function and h=.1516 is the bandwidth selected by Scott.
So, plugging in we get f hat (x) = 1/(36*.1516) (1/sqrt(2pi))[e^(-1/2 ((4.09-x)/.1516)^2 + e^(-1/2 ((4.46-x)/.1516)^2 + ... + e^(-1/2 ((4.38-x)/.1516)^2]
Ok. So we have a function of x. But how do we get the equation of each of the bell shaped curves in the above diagram? If we plug in, for example, 4.09, into f hat (x) we get a number, not a curve/function/distribution. Can someone help me understand the procedure to find the equation for the bell shaped curve/kernel density estimate?
Upvotes: 0
Views: 6386
Reputation: 11
## Given data
x <- c(4.09, 4.46, 4.61, 4.30, 4.03, 5.22, 4.21, 4.07, 4.02, 4.58, 4.66, 4.05,
4.23, 5.51, 4.03, 4.72, 4.47, 4.50, 5.80, 4.30, 4.09, 4.78, 4.18, 4.45,
4.40, 5.60, 4.37, 4.42, 4.88, 4.20, 4.45, 4.10, 4.43, 4.58, 4.40, 4.38)
h <- 0.1516
# GaussianKernel
GK <- function(u) {(1/sqrt(2*pi))*exp(-(u^2)/2)} # or dnorm(u)
This function gives a similar plot.
DensityGraph <- function(x, h){
n <- length(x)
xi <- seq(min(x) - sd(x), max(x) + sd(x), length.out = 512)
# fhat without sum since we are interest in the bell shaped curves
fhat <- sapply(x, function(y){(1/(n*h))*GK((xi - y)/h)})
# histogram of x
hist (x, freq = FALSE, nclass = 15, main = "Kernel density with histogram",
xlab = paste("N = ", n, " ", "Bandwidth = ", h))
# add fhat with sum
lines(xi, rowSums(fhat), lwd = 2)
# add the bell shaped curves
apply(fhat, 2, function(j) lines(xi, j, col = 4))
# show data points
rug (x, lwd = 2, col = 2)
}
DensityGraph(x = x, h = 0.05)
Blue bell shaped curves represent each data point of x
DensityGraph(x = x, h = 0.1516)
Compare with built in density function in R
lines(density(x = x, bw = 0.1516), col = 3, lwd = 2)
This function gives the value of fhat given a specific x
fhat <- function(x, h, specific_x){
n <- length(x)
xi <- seq(min(x) - sd(x), max(x) + sd(x), length.out = 512)
f <- rowSums(sapply(x, function(y){(1/(n*h))*GK((xi - y)/h)}))
kde <- data.frame(xi, fhat = f)
indx <- which.min(abs(xi - specific_x))
fx <- kde[indx, "fhat"]
list(fx = fx, kde = kde)
}
KernelDensity <- fhat(x = x, h = 0.1516, specific_x = 4.09)
KernelDensity$fx
# [1] 0.9114677
plot(KernelDensity$kde, type = "l", lwd = 2, xlab = "")
title(xlab = paste("N = ", n, " Bandwidth = ", h))
rug(x, lwd = 2, col = 2)
Compare built in density function
lines(density(x, bw = 0.1516), col = 5)
Upvotes: 1
Reputation: 206526
Here's a function that will return your fhat function given your x
values and h
value
get_fhat <- function(x, h) {
Vectorize(function(z) 1/length(x)/h*sum(dnorm((x-z)/h)))
}
This function returns a function that we can use to get values. We Vectorize
it so we can pass in multiple values at once to the function.
We can get a single value or plot it with
fhat <- get_fhat(x, .1516)
fhat(4.09)
# [1] 0.9121099
curve(fhat, from=min(x), to=max(x))
Upvotes: 2