Reputation: 1223
I have the following data.table, where each unique x
value is associated with a unique y
value. Then I force one x
value as NA
for purposes of the k-nearest neighbors exercise:
dt <- data.table(x = rep(c(1:4), 3),
y = rep(c("Brandon", "Erica", "Karyna", "Alex"), 3))
dt[3, 1] <- NA
print(dt)
# x y
#1: 1 Brandon
#2: 2 Erica
#3: NA Karyna
#4: 4 Alex
#5: 1 Brandon
#6: 2 Erica
#7: 3 Karyna
#8: 4 Alex
#9: 1 Brandon
#10: 2 Erica
#11: 3 Karyna
#12: 4 Alex
Referencing the first answer to this question, I created a binary matrix out of dt$y
as so:
dt.a <- model.matrix(~ y -1 , data = dt)
dt2 <- cbind(dt[, -2, with = FALSE], dt.a)
print(dt2)
# x yAlex yBrandon yErica yKaryna
#1: 1 0 1 0 0
#2: 2 0 0 1 0
#3: NA 0 0 0 1
#4: 4 1 0 0 0
#5: 1 0 1 0 0
#6: 2 0 0 1 0
#7: 3 0 0 0 1
#8: 4 1 0 0 0
#9: 1 0 1 0 0
#10: 2 0 0 1 0
#11: 3 0 0 0 1
#12: 4 1 0 0 0
Using the knnImpute
method from the preProcess
function of the caret
package, I would expect that the center-and-scaled output below of dt3[1, 3]
would equal rows 7 and 12. But it does not. In fact, it looks to be almost equal the negative value of rows 7 and 12.
preobj <- preProcess(dt2, method = "knnImpute")
dt3 <- predict(preobj, dt2)
print(dt3)
# x yAlex yBrandon yErica yKaryna
#1: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
#2: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
#3: -0.04494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
#4: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
#5: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
#6: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
#7: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
#8: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
#9: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
#10: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
#11: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
#12: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
Shouldn't dt3$x
's row 3 equal rows 7 and 11? If so, what do I need to change in my script? If not, why?
Upvotes: 3
Views: 3374
Reputation: 708
To understand what is happening you first need to understand the way the method knnImpute
in the function preProcess
of caret
package works. Various flavors of k-nearest Neighbor imputation are available and different people implement it in different ways in different software packages.
you can use weighted mean, median, or even simple mean of the k-nearest neighbor to replace the missing values. There are several distance metrics to calculate different distances for finding the neighbors.
Now Specific to your problems here are some questions that arises with their answer.
1.How many nearest neighbors are being considered here?
Default is 5. You can change it by specifying the parameter k
in the preProcess
function.
2.Which distance metric is being used?
In the above case euclidean distance is used.
3.What's the dimension of the space in which distance is being calculated and how it is found?
In your case it's four dimensional space. It is obtained by taking the columns which do not have missing values. Hence in your case it's column number 2, 3, 4, 5
.
Based on the above explanation if you try to find the five nearest neighbors ( nn
) in the dataset after removing the row having NA
which is stored in preobj$data
, you will get the following indices ( nn.idx
) and the corresponding distances ( nn.dists
) as below.
> nn
$nn.idx
[,1] [,2] [,3] [,4] [,5]
[1,] 10 6 5 9 2
$nn.dists
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 3.126944 3.126944 3.126944
4.Now finally how to replace the NA
value?
To replace the NA
value simply take the mean of the values in the missing columns corresponding to the nearest indices.
> preobj$data
x yAlex yBrandon yErica yKaryna
1: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
2: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
3: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
4: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
5: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
6: 0.4494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
7: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
8: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
9: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
10: 0.4494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
11: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
> mean(preobj$data$x[nn$nn.idx])
[1] -0.04494666
And you will find that indeed the NA
is replaced by this value in the output.
> dt3
x yAlex yBrandon yErica yKaryna
1: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
2: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
3: -0.04494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
4: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
5: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
6: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
7: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
8: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
9: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
10: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
11: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
12: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
Note the third row.
To replace the value of NA
simply with the nearest neighbor's corresponding value you can simply use k=1
.
Upvotes: 6