Fouad Selmane
Fouad Selmane

Reputation: 408

DocumentTermMatrix in tm package does not return all words

I'm creating a document-term matrix with the tm-package in R, but some of the words in my corpus get lost in the process somewhere.

I will explain with an example. Let's say I have this small corpus

library(tm)
crps <- " more hours to my next class bout to go home and go night night"
crps <- VCorpus(VectorSource(crps))

When I use DocumentTermMatrix() from the tm-package, it will return these results:

dm <- DocumentTermMatrix(crps)
dm_matrix <- as.matrix(dm)
dm_matrix
# Terms
# Docs and bout class home hours more next night
# 1   1    1     1    1     1    1    1     2

However, what I want (and expected) is:

# Docs and bout class home hours more next night my  go to
#  1   1    1     1    1     1    1    1     2   1   2  1

Why does DocumentTermMatrix() skip the words "my","go"and "to"? Is there a way to control and fix this function?

Upvotes: 4

Views: 1453

Answers (1)

KenHBS
KenHBS

Reputation: 7164

DocumentTermMatrix() automatically discards words that are less than three characters. Therefore, the words to, my and go are not considered when constructing the document-term matrix.

From the help page ?DocumentTermMatrix, you can see there's an optional argument called control. This optional argument has a number of default values for numerous things (see the help page ?termFreq for more details). One of these defaults is a word length of at least three characters, i.e. wordLengths = c(3, Inf). You can change this to accommodate for all words, regardless of word length:

dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf))

inspect(dm)
# <<DocumentTermMatrix (documents: 1, terms: 11)>>
# Non-/sparse entries: 11/0
# Sparsity           : 0%
# Maximal term length: 5
# Weighting          : term frequency (tf)
#
#    Terms
# Docs and bout class go home hours more my next night to
#    1   1    1     1  2    1     1    1  1    1     2  2

Upvotes: 6

Related Questions