R - Tokenization - single and two letter words in a TermDocumentMatrix

Question

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.

The issue is that it seems to display only 3 letter words and more.

    library(tm)
    library(RWeka)

    test<-'This is a test.'

    testmyCorpus<-Corpus(VectorSource(test))
    testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
    inspect(testTDF)

Only the words "this" and "test" are displayed. Any ideas?

Thanks a lot for you help! Robert

Nikita Astrakhantsev · Accepted Answer

Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

R - Tokenization - single and two letter words in a TermDocumentMatrix

Answers (1)

Related Questions