Robert
Robert

Reputation: 13

R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.

The issue is that it seems to display only 3 letter words and more.

    library(tm)
    library(RWeka)

    test<-'This is a test.'

    testmyCorpus<-Corpus(VectorSource(test))
    testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
    inspect(testTDF)

Only the words "this" and "test" are displayed. Any ideas?

Thanks a lot for you help! Robert

Upvotes: 1

Views: 1635

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

Upvotes: 2

Related Questions