user1870614
user1870614

Reputation: 414

Local linear regression in R -- locfit() vs locpoly()

I am trying to understand the different behaviors of these two smoothing functions when given apparently equivalent inputs. My understanding was that locpoly just takes a fixed bandwidth argument, while locfit can also include a varying part in its smoothing parameter (a nearest-neighbors fraction, "nn"). I thought setting this varying part to zero in locfit should make the "h" component act like the fixed bandwidth used in locpoly, but this is evidently not the case.

A working example:

library(KernSmooth)
library(locfit)
set.seed(314)

n <- 100
x <- runif(n, 0, 1)
eps <- rnorm(n, 0, 1)
y <- sin(2 * pi * x) + eps

plot(x, y)
lines(locpoly(x, y, bandwidth=0.05, degree=1), col=3)
lines(locfit(y ~ lp(x, nn=0, h=0.05, deg=1)), col=4)

Produces this plot:

plot of smoothers

locpoly gives the smooth green line, and locfit gives the wiggly blue line. Clearly, locfit has a smaller "effective" bandwidth here, even though the supposed bandwidth parameter has the same value for each.

What are these functions doing differently?

Upvotes: 28

Views: 13240

Answers (2)

wmay
wmay

Reputation: 234

I changed your code a bit so we can see more clearly what the actual window widths are:

library(KernSmooth)
library(locfit)
x <- seq(.1, .9, length.out = 80)
y <- rep(0:1, each = 40)
plot(x, y)
lines(locpoly(x, y, bandwidth=0.1, degree=1), col=3)
lines(locfit(y ~ lp(x, nn=0, h=0.1, deg=1)), col=4)

enter image description here

The argument h from locfit appears to be a half-window width. locpoly's bandwidth is clearly doing something else.

KernSmooth's documentation is very ambiguous, but judging from the source code (here and here), it looks like the bandwidth is the standard deviation of a normal density function. Hopefully this is explained in the Kernel Smoothing book they cite.

Upvotes: 1

znr
znr

Reputation: 61

The two parameters both represent smoothing, but they do so in two different ways.

locpoly's bandwidth parameter is relative to the scale of the x-axis here. For example, if you changed the line x <- runif(n, 0, 1) to x <- runif(n, 0, 10), you will see that the green locpoly line becomes much more squiggly despite the fact that you still have the same number of points (100).

locfit's smoothing parameter, h, is independent of the scale, and instead is based on a proportion of the data. The value 0.05 means 5% of the data that is closest to that position is used to fit the curve. So changing the scale would not alter the line.

This also explains the observation made in the comment that changing the value of h to 0.1 makes the two look nearly identical. This makes sense, because we can expect that a bandwidth of 0.05 will contain about 10% of the data if we have 100 points distributed uniformly from 0 to 1.

My sources include the documentation for the locfit package and the documentation for the locpoly function.

Upvotes: 3

Related Questions