Reputation: 11
I am using tm package in R to count frequency of words containing dash in a vector.
My code is like:
a = c("happy_day", "great_book", "funny_movie")
myCorpus = Corpus(VectorSource(a))
myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))
freq = sort(colSums(as.matrix(myDTM)), decreasing = T)
I expect the tm package count the three text string as three words, but actually it treat each string as two words.
My expected content of freq is:
funny_movie great_book happy_day
1 1 1
However, what I actually get is
book day funny great happy movie
1 1 1 1 1 1
I have used similar code a few weeks ago, and at that time the code did give me the expected results. But now I always get the unexpected result, even if I use myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1, removePunctuation = FALSE))
.
Do you know what I can do to count words with underscore "_" in my file?
Thanks a lot!
Upvotes: 1
Views: 137
Reputation: 206197
If you change
myCorpus = Corpus(VectorSource(a))
to
myCorpus = VCorpus(VectorSource(a))
you should get the result you want. The Corpus
call default to returning a SimpleCorpus
. When you run a DocumentTermMatrix
on that, it executes an efficinet pipeline with a bunch of tasks that most(?) users wanted (basically ignoring the control=
parameter. You can get around that by explicitly creating a VCorpus
Upvotes: 2