andrealmeida
andrealmeida

Reputation: 101

R KNN categorization with DocumentTermMatrixes

i'm giving first steps with R and now im testing the KNN classification method (package class), but im struggling to put it working.

I have two DocumentTermMatrix, one for train and another for test.

https://www.dropbox.com/s/218veow5tqrhlcw/train_test_matrix.png

i think im doing all right.

## Test KNN Classification
train = dtm_control_tfidf_treino # train set from 1:7
test = dtm_control_tfidf_teste   # test set from 8:10
cl = factor(dtm_control_tfidf_treino$class[1:7])
x = knn(train, test, cl, k = 3, prob = TRUE)
attributes(.Last.value)

i'm getting the error

> x = knn(train, test, cl, k = 3, prob = TRUE)
Error in knn(train, test, cl, k = 3, prob = TRUE) : 
'train' and 'class' have different lengths

i'm really do not understand how make this work. if semeone could give me some hints how to do this process well would be nice.

if you need more data or wtv just ask

Upvotes: 1

Views: 626

Answers (1)

MrFlick
MrFlick

Reputation: 206167

If you subset the corpus, each of the DTMs will have different words. This is not what you want. You want them to share a common term list. So instead, build the DTM with all documents, then subset the DTM to make the test/train sets. Here's an example using built in data sets.

reut21578 <- system.file("texts", "crude", package = "tm")
cc<-VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

dtm<-DocumentTermMatrix(cc)

train<-dtm[1:7,]
test<-dtm[8:10,]

knn(train,test,factor(letters[1:7]), k=3, prob=T)

Upvotes: 1

Related Questions