Praveen Gopal
Praveen Gopal

Reputation: 539

How to implement Knn-algorithm without using k-nn function in r?

Is there any link to refer? Actually i am new to r programming. I have no idea how to implement without k-nn function.I find example code only by using knn function.

Upvotes: 3

Views: 2773

Answers (1)

nathanesau
nathanesau

Reputation: 1721

I've made an example showing how the algorithm works and how to implement it in R. Here's the data for my example.

students_known <- data.frame(
      Major = c(rep("Arts",4),rep("Applied Science", 3),
                rep("Education",3), rep("Science",6)),
      Tuition = c(2000,2200,2100,1900,2800,3000,2900,
                  2500,2700,2600,3100,3200,3150,3000,
                  3175,3300),
      GPA = c(3.55,3.40,3.30,3.50,2.90,3.05,2.50,
              3.80,3.45,3.35,3.00,3.50,4.00,3.40,
              3.45,3.30),
      Age = c(20,18,22,21,24,23,21,19,19,21,20,18,17,21,24,23)
)
students_known

#              Major Tuition  GPA Age
# 1             Arts    2000 3.55  20
# 2             Arts    2200 3.40  18 
# 3             Arts    2100 3.30  22
# 4             Arts    1900 3.50  21
# 5  Applied Science    2800 2.90  24
# 6  Applied Science    3000 3.05  23
# 7  Applied Science    2900 2.50  21
# 8        Education    2500 3.80  19
# 9        Education    2700 3.45  19
# 10       Education    2600 3.35  21
# 11         Science    3100 3.00  20
# 12         Science    3200 3.50  18
# 13         Science    3150 4.00  17
# 14         Science    3000 3.40  21
# 15         Science    3175 3.45  24
# 16         Science    3300 3.30  23    

Suppose that the Major of the 3rd and 11th rows are unknown and we want to use the knn algorithm to impute the Major. In this case the class is the Major and the variables we use to compute the distance is the Tuition, GPA and Age (which are all numeric).

students_unknown <- students_known
students_unknown[3,1] <- NA
students_unknown[11,1] <- NA

I've implemented a knn algorithm in the function below. The steps of the algorithm are:

  1. Weight rows (optional). In this example, if rows aren't weighted than Tuition has a much larger effect on the distance then GPA and Age since it is so much larger.

  2. For each row with a missing value (Major), compute the total squared distance for each complete row (known Major).

  3. Select the k rows with the smallest squared distance. Choose the Major which makes up the largest proportion of these k rows.

Using this procedure, you can impute the missing values for the Major column.

# train_cols should be numeric; imp_col should represent class
knn <- function(data, imp_col, train_cols, k=5,
                weight=FALSE) {
  if(weight) {
    col_means <- sapply(data[,train_cols],mean)
    col_weights <- max(col_means)/col_means
    for(i in 1:length(train_cols)) 
      data[,train_cols[i]] <- data[,train_cols[i]]*col_weights[i]
  }

  data_complete <- data[complete.cases(data),]
  data_incomplete <- data[!complete.cases(data),]
  ncomplete <- length(data_complete[,1])
  nincomplete <- length(data_incomplete[,1])

  for(j in 1:nincomplete) {
    d <- numeric(ncomplete)
    for(i in train_cols) 
      d <- d + (data_complete[,i] - data_incomplete[j,i])^2
    # indices of k-nearest neighbors
    knn_index <- head(sort(d,index.return=T)$ix,k) 
    nn <- sort(table(data_complete[knn_index,imp_col]), T)
    data_incomplete[j,imp_col]<- names(nn)[1]
  }
  data_incomplete
}

Here's the algorithm in use.

data_actual <- students_known[c(3,11),]
data_imputed1 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3, 
                    weight=FALSE)
data_imputed2 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3,
                     weight=TRUE)

print(as.character(data_actual$Major))

# [1] "Arts"    "Science" 

print(as.character(data_imputed1$Major)) # imputes properly

# [1] "Arts"    "Science"

print(as.character(data_imputed2$Major)) # doesn't impute properly

# [1] "Arts"    "Applied Science"

Upvotes: 1

Related Questions