Reputation: 539
Is there any link to refer? Actually i am new to r programming. I have no idea how to implement without k-nn function.I find example code only by using knn function.
Upvotes: 3
Views: 2773
Reputation: 1721
I've made an example showing how the algorithm works and how to implement it in R. Here's the data for my example.
students_known <- data.frame(
Major = c(rep("Arts",4),rep("Applied Science", 3),
rep("Education",3), rep("Science",6)),
Tuition = c(2000,2200,2100,1900,2800,3000,2900,
2500,2700,2600,3100,3200,3150,3000,
3175,3300),
GPA = c(3.55,3.40,3.30,3.50,2.90,3.05,2.50,
3.80,3.45,3.35,3.00,3.50,4.00,3.40,
3.45,3.30),
Age = c(20,18,22,21,24,23,21,19,19,21,20,18,17,21,24,23)
)
students_known
# Major Tuition GPA Age
# 1 Arts 2000 3.55 20
# 2 Arts 2200 3.40 18
# 3 Arts 2100 3.30 22
# 4 Arts 1900 3.50 21
# 5 Applied Science 2800 2.90 24
# 6 Applied Science 3000 3.05 23
# 7 Applied Science 2900 2.50 21
# 8 Education 2500 3.80 19
# 9 Education 2700 3.45 19
# 10 Education 2600 3.35 21
# 11 Science 3100 3.00 20
# 12 Science 3200 3.50 18
# 13 Science 3150 4.00 17
# 14 Science 3000 3.40 21
# 15 Science 3175 3.45 24
# 16 Science 3300 3.30 23
Suppose that the Major of the 3rd and 11th rows are unknown and we want to use the knn algorithm to impute the Major. In this case the class is the Major and the variables we use to compute the distance is the Tuition, GPA and Age (which are all numeric).
students_unknown <- students_known
students_unknown[3,1] <- NA
students_unknown[11,1] <- NA
I've implemented a knn algorithm in the function below. The steps of the algorithm are:
Weight rows (optional). In this example, if rows aren't weighted than Tuition has a much larger effect on the distance then GPA and Age since it is so much larger.
For each row with a missing value (Major), compute the total squared distance for each complete row (known Major).
Select the k rows with the smallest squared distance. Choose the Major which makes up the largest proportion of these k rows.
Using this procedure, you can impute the missing values for the Major column.
# train_cols should be numeric; imp_col should represent class
knn <- function(data, imp_col, train_cols, k=5,
weight=FALSE) {
if(weight) {
col_means <- sapply(data[,train_cols],mean)
col_weights <- max(col_means)/col_means
for(i in 1:length(train_cols))
data[,train_cols[i]] <- data[,train_cols[i]]*col_weights[i]
}
data_complete <- data[complete.cases(data),]
data_incomplete <- data[!complete.cases(data),]
ncomplete <- length(data_complete[,1])
nincomplete <- length(data_incomplete[,1])
for(j in 1:nincomplete) {
d <- numeric(ncomplete)
for(i in train_cols)
d <- d + (data_complete[,i] - data_incomplete[j,i])^2
# indices of k-nearest neighbors
knn_index <- head(sort(d,index.return=T)$ix,k)
nn <- sort(table(data_complete[knn_index,imp_col]), T)
data_incomplete[j,imp_col]<- names(nn)[1]
}
data_incomplete
}
Here's the algorithm in use.
data_actual <- students_known[c(3,11),]
data_imputed1 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3,
weight=FALSE)
data_imputed2 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3,
weight=TRUE)
print(as.character(data_actual$Major))
# [1] "Arts" "Science"
print(as.character(data_imputed1$Major)) # imputes properly
# [1] "Arts" "Science"
print(as.character(data_imputed2$Major)) # doesn't impute properly
# [1] "Arts" "Applied Science"
Upvotes: 1