Reputation: 101
i'm giving first steps with R and now im testing the KNN classification method (package class), but im struggling to put it working.
I have two DocumentTermMatrix, one for train and another for test.
https://www.dropbox.com/s/218veow5tqrhlcw/train_test_matrix.png
i think im doing all right.
## Test KNN Classification
train = dtm_control_tfidf_treino # train set from 1:7
test = dtm_control_tfidf_teste # test set from 8:10
cl = factor(dtm_control_tfidf_treino$class[1:7])
x = knn(train, test, cl, k = 3, prob = TRUE)
attributes(.Last.value)
i'm getting the error
> x = knn(train, test, cl, k = 3, prob = TRUE)
Error in knn(train, test, cl, k = 3, prob = TRUE) :
'train' and 'class' have different lengths
i'm really do not understand how make this work. if semeone could give me some hints how to do this process well would be nice.
if you need more data or wtv just ask
Upvotes: 1
Views: 626
Reputation: 206167
If you subset the corpus, each of the DTMs will have different words. This is not what you want. You want them to share a common term list. So instead, build the DTM with all documents, then subset the DTM to make the test/train sets. Here's an example using built in data sets.
reut21578 <- system.file("texts", "crude", package = "tm")
cc<-VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
dtm<-DocumentTermMatrix(cc)
train<-dtm[1:7,]
test<-dtm[8:10,]
knn(train,test,factor(letters[1:7]), k=3, prob=T)
Upvotes: 1