Si Yan
Si Yan

Reputation: 11

How to avoid the removal of punctuations when using R with tm package

I am using tm package in R to count frequency of words containing dash in a vector.

My code is like:

a = c("happy_day", "great_book", "funny_movie")

myCorpus = Corpus(VectorSource(a))

myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1))

freq = sort(colSums(as.matrix(myDTM)), decreasing = T)

I expect the tm package count the three text string as three words, but actually it treat each string as two words.

My expected content of freq is:

funny_movie   great_book   happy_day

    1             1            1

However, what I actually get is

book  day funny great happy movie

 1      1     1     1     1     1

I have used similar code a few weeks ago, and at that time the code did give me the expected results. But now I always get the unexpected result, even if I use myDTM = DocumentTermMatrix(myCorpus, control = list(minWordLength = 1, removePunctuation = FALSE)).

Do you know what I can do to count words with underscore "_" in my file?

Thanks a lot!

Upvotes: 1

Views: 137

Answers (1)

MrFlick
MrFlick

Reputation: 206197

If you change

myCorpus = Corpus(VectorSource(a))

to

myCorpus = VCorpus(VectorSource(a))

you should get the result you want. The Corpus call default to returning a SimpleCorpus. When you run a DocumentTermMatrix on that, it executes an efficinet pipeline with a bunch of tasks that most(?) users wanted (basically ignoring the control= parameter. You can get around that by explicitly creating a VCorpus

Upvotes: 2

Related Questions