Reputation: 55
I have encountered a problem while using the k-nearest neighbors algorithm (with cross validation) on a data set in R, the knn.cv
from the FNN package.
The data set consists of 4601 email cases with 58 attributes, with the 57 depending on character or word frequencies in the emails(numerical, range [0,100])
, and the last one indicating if it is spam (value 1) or ham (value 0).
After indicating train and cl variables and using 10 neighbors, running the package presents a list of all the emails with values like 7.4032
at each column, which I don't know how to use. I need to find the percentage of spam and ham the package classifies and compare it with the correct percentage. How should I interpret these results?
Upvotes: 3
Views: 6430
Reputation: 173677
Given that the data set you describe matches (exactly) the spam data set in the ElemStatLearn package accompanying the well-known book by the same title, I'm wondering if this is in fact a homework assignment. If that's the case, it's ok, but you should add the homework tag to your question.
Here are some pointers.
The documentation for the function knn.cv
says that it returns a vector of classifications, along with the distances and indices of the k nearest neighbors as "attributes". So when I run this:
out <- knn.cv(spam[,-58],spam[,58],k = 10)
The object out
looks sort of like this:
> head(out)
[1] spam spam spam spam spam email
Levels: email spam
The other values you refer to are sort of "hidden" as attributes, but you can see that they are there using str
:
> str(out)
Factor w/ 2 levels "email","spam": 2 2 2 2 2 1 1 1 2 2 ...
- attr(*, "nn.index")= int [1:4601, 1:10] 446 1449 500 5 4 4338 2550 4383 1470 53 ...
- attr(*, "nn.dist")= num [1:4601, 1:10] 8.10e-01 2.89 1.50e+02 2.83e-03 2.83e-03 ...
You can access those additional attributes via something like this:
nn.index <- attr(out,'nn.index')
nn.dist <- attr(out,'nn.dist')
Note that both of these objects end up being matrices of dimension 4601 x 10, which makes sense, since the documentation said that they recorded the index (i.e. row number) of the k = 10
nearest neighbors as well as the distances to each.
For the last bit, you will probably find the table()
function useful, as well as prop.table()
.
Upvotes: 4