Marios D. Lokas
Marios D. Lokas

Reputation: 55

Interpret knn.cv (R) results after applying on data set

I have encountered a problem while using the k-nearest neighbors algorithm (with cross validation) on a data set in R, the knn.cv from the FNN package. The data set consists of 4601 email cases with 58 attributes, with the 57 depending on character or word frequencies in the emails(numerical, range [0,100]) , and the last one indicating if it is spam (value 1) or ham (value 0).

After indicating train and cl variables and using 10 neighbors, running the package presents a list of all the emails with values like 7.4032 at each column, which I don't know how to use. I need to find the percentage of spam and ham the package classifies and compare it with the correct percentage. How should I interpret these results?

Upvotes: 3

Views: 6430

Answers (1)

joran
joran

Reputation: 173677

Given that the data set you describe matches (exactly) the spam data set in the ElemStatLearn package accompanying the well-known book by the same title, I'm wondering if this is in fact a homework assignment. If that's the case, it's ok, but you should add the homework tag to your question.

Here are some pointers.

The documentation for the function knn.cv says that it returns a vector of classifications, along with the distances and indices of the k nearest neighbors as "attributes". So when I run this:

out <- knn.cv(spam[,-58],spam[,58],k = 10)

The object out looks sort of like this:

> head(out)
[1] spam  spam  spam  spam  spam  email
Levels: email spam

The other values you refer to are sort of "hidden" as attributes, but you can see that they are there using str:

> str(out)
 Factor w/ 2 levels "email","spam": 2 2 2 2 2 1 1 1 2 2 ...
 - attr(*, "nn.index")= int [1:4601, 1:10] 446 1449 500 5 4 4338 2550 4383 1470 53 ...
 - attr(*, "nn.dist")= num [1:4601, 1:10] 8.10e-01 2.89 1.50e+02 2.83e-03 2.83e-03 ...

You can access those additional attributes via something like this:

nn.index <- attr(out,'nn.index')
nn.dist <- attr(out,'nn.dist')

Note that both of these objects end up being matrices of dimension 4601 x 10, which makes sense, since the documentation said that they recorded the index (i.e. row number) of the k = 10 nearest neighbors as well as the distances to each.

For the last bit, you will probably find the table() function useful, as well as prop.table().

Upvotes: 4

Related Questions