jeza
jeza

Reputation: 299

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.

The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.

I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.

library(philentropy)
knn <- function(xmat, k,method){
  n <- nrow(xmat)
  if (n <= k) stop("k can not be more than n-1")
  neigh <- matrix(0, nrow = n, ncol = k)
  for(i in 1:n) {
    ddist<- distance(xmat, method)  
    neigh[i, ] <- order(ddist)[2:(k + 1)]
  }
  return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")

Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).

My question is

Am I right or wrong with the codes?

Any suggestions or reference that will guide me will be highly appreciated.

EDIT

My data (breast-cancer-wisconsin)(wdbc) dimension is

569  32

After normalizing and removing the id and target column the dimension is

dim(wdbc_n)
569  30

The train and test split is given by

wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]

Upvotes: 2

Views: 4570

Answers (1)

GuillaumeL
GuillaumeL

Reputation: 1015

Am I right or wrong with the codes?

Your code is wrong.

The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:

ddist<- distance(xmat, method)  

Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.

Let's take a look at the documentation for the distance function:

distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit = "log", est.prob = NULL)

x a numeric data.frame or matrix (storing probability vectors) or a numeric data.frame or matrix storing counts (if est.prob is specified).

(...)

in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 : a distance matrix storing distance values for all pairwise probability vector comparisons.

In your specific case (knn classification), you want to use the 2 row version.

One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.

Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.

Standalone working solution(s)

library(philentropy)
normalize <- function(x) {
 return ((x - min(x)) / (max(x) - min(x)))
}

knn <- function(train, test, k, method){
  n.test <- nrow(test)
  n.train <- nrow(train)
  if (n.train + n.test <= k) stop("k can not be more than n-1")
  neigh <- matrix(0, nrow = n.test, ncol = k) 
  ddist <- NULL
  for(i in 1:n.test) {
    for(j in 1:n.train) {
      xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
      ddist[j] <- distance(as.data.frame(xmat), method, k)  #then we calculate the distance and append it to the ddist vector.
    }
    neigh[i, ] <- sort(ddist)[2:(k + 1)] 
  }
  return(neigh)
}

wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")

Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.

Now does that work?

The two first test rows using the custom knn function:

          [,1]      [,2]      [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369

Let us compare with the equivalent function in the FNN package:

library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist

          [,1]      [,2]      [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161

Conclusion: not too shabby.

Upvotes: 1

Related Questions