Reputation: 326
I would like to understand how the R kknn package calculates weights, distances, and class probabilities for binary classification problems. In the R code below, there are three observations in the training sample and one observation in the holdout sample. The two predictor variables are height and weight. With Euclidean distance, the distances for each observation in the training sample are then:
sqrt((6-8)^2 + (4-5)^2) = 2.24
sqrt((6-3)^2 + (4-7)^2) = 4.24
sqrt((6-7)^2 + (4-3)^2) = 1.41.
With k=3 and with equal weights, I get a probability for the holdout as:
(1/3 * 1) + (1/3 * 0) + (1/3 * 1) = 0.67.
With k=2 and with equal weights, I get a probability for the holdout as:
(1/2 * 1) + (1/2 * 1) = 1.00.
I would like to understand how the R kknn package makes these same calculations with the "triangular," "gaussian," and "inverse" weights (and more generally).
library(kknn)
training <- data.frame(class = c(1, 0, 1), height = c(8, 3, 7), weight = c(5, 7, 3))
holdouts <- data.frame(class = 1, height = 6, weight = 4)
triangular_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "triangular", k = 3)
triangular_kernel[["fitted.values"]]
triangular_kernel[["W"]]
triangular_kernel[["D"]]
gaussian_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "gaussian", k = 3)
gaussian_kernel[["fitted.values"]]
gaussian_kernel[["W"]]
gaussian_kernel[["D"]]
inverse_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "inv", k = 3)
inverse_kernel[["fitted.values"]]
inverse_kernel[["W"]]
inverse_kernel[["D"]]
Upvotes: 0
Views: 1450
Reputation: 4419
Calling kknn::kknn
prints the source code for the kknn function in the console. With it, one can go through the function line by line to see what it does.
kknn
calls a compiled C code dmEuclid
. To obtain its source code, we follow this guide, writing the following code in R:
untar(download.packages(pkgs = "kknn", destdir = ".", type = "source")[,2])
and then open the src directory of kknn_1.3.1.tar in your working directory (getwd()
) to find and open dm.C using any text editor. Scroll about halfway to find dmEuclid. To test the exact outputs of dmEuclid
, you could install the build tools, and open a C++ file in Rstudio by selecting it in the dropdown menu, and run the code with different inputs.
Following the function outputs, in your case the dmtmp$dm
results in
3.779645e-01 1.133893e+00 1.000000e+150 3.685210e-156
Per your specification k
, the first 3 values are chosen as distance D
.
This is manually converted to maxdist = 1e-06
by the package author, as the max distance is smaller than that in your case.
The kknn
function uses the following section to allocate a weight scheme, per your defined kernel.
W <- D/maxdist
W <- pmin(W, 1 - (1e-06))
W <- pmax(W, 1e-06)
at this point your W values are larger than 1, and so W is then coerced to approximately 1.
if (kernel == "inv"
W <- 1/W
if (kernel == "triangular")
W <- 1 - W
if (kernel == "gaussian") {
alpha = 1/(2 * (k + 1))
qua = abs(qnorm(alpha))
W = W * qua
W = dnorm(W, sd = 1)
}
the explanation for which can be found in the paper linked by gowerc.
W is then converted to matrix W <- matrix(W, p, k)
with 1 row (p=1), 3 columns (k=3)
p = 1
in your case is 1, k=3
, cl = c(1,0,1)
.
C <- matrix(dmtmp$cl, nrow = p, ncol = k + 1)
C <- C[, 1:k] + 1
CL <- matrix(cl[C], nrow = p, ncol = k)
W <- matrix(W, p, k)
fit <- rowSums(W * CL)/pmax(rowSums(W), 1e-06)
Upvotes: 5