Reputation: 13
I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.
The issue is that it seems to display only 3 letter words and more.
library(tm)
library(RWeka)
test<-'This is a test.'
testmyCorpus<-Corpus(VectorSource(test))
testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
inspect(testTDF)
Only the words "this" and "test" are displayed. Any ideas?
Thanks a lot for you help! Robert
Upvotes: 1
Views: 1635
Reputation: 4749
Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf)
to TermDocumentMatrix.
Upvotes: 2