bshelt141
bshelt141

Reputation: 1223

knnImpute using categorical variables with caret package

I have the following data.table, where each unique x value is associated with a unique y value. Then I force one x value as NA for purposes of the k-nearest neighbors exercise:

dt <- data.table(x = rep(c(1:4), 3), 
                 y = rep(c("Brandon", "Erica", "Karyna", "Alex"), 3))
dt[3, 1] <- NA

print(dt)
 #    x       y
 #1:  1 Brandon
 #2:  2   Erica
 #3: NA  Karyna
 #4:  4    Alex
 #5:  1 Brandon
 #6:  2   Erica
 #7:  3  Karyna
 #8:  4    Alex
 #9:  1 Brandon
#10:  2   Erica
#11:  3  Karyna
#12:  4    Alex

Referencing the first answer to this question, I created a binary matrix out of dt$y as so:

dt.a <- model.matrix(~ y -1 , data = dt)
dt2 <- cbind(dt[, -2, with = FALSE], dt.a)

print(dt2)
 #    x yAlex yBrandon yErica yKaryna
 #1:  1     0        1      0       0
 #2:  2     0        0      1       0
 #3: NA     0        0      0       1
 #4:  4     1        0      0       0
 #5:  1     0        1      0       0
 #6:  2     0        0      1       0
 #7:  3     0        0      0       1
 #8:  4     1        0      0       0
 #9:  1     0        1      0       0
#10:  2     0        0      1       0
#11:  3     0        0      0       1
#12:  4     1        0      0       0

Using the knnImpute method from the preProcess function of the caret package, I would expect that the center-and-scaled output below of dt3[1, 3] would equal rows 7 and 12. But it does not. In fact, it looks to be almost equal the negative value of rows 7 and 12.

preobj <- preProcess(dt2, method = "knnImpute")
dt3 <- predict(preobj, dt2)

print(dt3)
 #             x      yAlex   yBrandon     yErica    yKaryna
 #1: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 #2: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 #3: -0.04494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
 #4:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 #5: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 #6: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 #7:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
 #8:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 #9: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
#10: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
#11:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
#12:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708

Shouldn't dt3$x's row 3 equal rows 7 and 11? If so, what do I need to change in my script? If not, why?

Upvotes: 3

Views: 3374

Answers (1)

9Heads
9Heads

Reputation: 708

To understand what is happening you first need to understand the way the method knnImpute in the function preProcess of caret package works. Various flavors of k-nearest Neighbor imputation are available and different people implement it in different ways in different software packages.

you can use weighted mean, median, or even simple mean of the k-nearest neighbor to replace the missing values. There are several distance metrics to calculate different distances for finding the neighbors.

Now Specific to your problems here are some questions that arises with their answer.

1.How many nearest neighbors are being considered here?

Default is 5. You can change it by specifying the parameter k in the preProcess function.

2.Which distance metric is being used?

In the above case euclidean distance is used.

3.What's the dimension of the space in which distance is being calculated and how it is found?

In your case it's four dimensional space. It is obtained by taking the columns which do not have missing values. Hence in your case it's column number 2, 3, 4, 5.

Based on the above explanation if you try to find the five nearest neighbors ( nn ) in the dataset after removing the row having NA which is stored in preobj$data , you will get the following indices ( nn.idx ) and the corresponding distances ( nn.dists ) as below.

> nn
$nn.idx
     [,1] [,2] [,3] [,4] [,5]
[1,]   10    6    5    9    2

$nn.dists
     [,1] [,2]     [,3]     [,4]     [,5]
[1,]    0    0 3.126944 3.126944 3.126944

4.Now finally how to replace the NA value?

To replace the NA value simply take the mean of the values in the missing columns corresponding to the nearest indices.

> preobj$data
             x      yAlex   yBrandon     yErica    yKaryna
 1: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 2: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
 3:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708
 4: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 5: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
 6:  0.4494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
 7:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708
 8: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 9: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
10:  0.4494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
11:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708

> mean(preobj$data$x[nn$nn.idx])
[1] -0.04494666

And you will find that indeed the NA is replaced by this value in the output.

> dt3
              x      yAlex   yBrandon     yErica    yKaryna
 1: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 2: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 3: -0.04494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
 4:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 5: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 6: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 7:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
 8:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 9: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
10: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
11:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
12:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708

Note the third row.

To replace the value of NA simply with the nearest neighbor's corresponding value you can simply use k=1.

Upvotes: 6

Related Questions