How to avoid the removal of punctuations when using R with tm package

Question

I am using tm package in R to count frequency of words containing dash in a vector.

My code is like:

a = c("happy_day", "great_book", "funny_movie")

myCorpus = Corpus(VectorSource(a))

myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))

freq = sort(colSums(as.matrix(myDTM)), decreasing = T)

I expect the tm package count the three text string as three words, but actually it treat each string as two words.

My expected content of freq is:

funny_movie   great_book   happy_day

    1             1            1

However, what I actually get is

book  day funny great happy movie

 1      1     1     1     1     1

I have used similar code a few weeks ago, and at that time the code did give me the expected results. But now I always get the unexpected result, even if I use myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1, removePunctuation = FALSE)).

Do you know what I can do to count words with underscore "_" in my file?

Thanks a lot!

How to avoid the removal of punctuations when using R with tm package

Answers (1)

Related Questions