Reputation: 408
I'm creating a document-term matrix with the tm-package in R, but some of the words in my corpus get lost in the process somewhere.
I will explain with an example. Let's say I have this small corpus
library(tm)
crps <- " more hours to my next class bout to go home and go night night"
crps <- VCorpus(VectorSource(crps))
When I use DocumentTermMatrix()
from the tm-package, it will return these results:
dm <- DocumentTermMatrix(crps)
dm_matrix <- as.matrix(dm)
dm_matrix
# Terms
# Docs and bout class home hours more next night
# 1 1 1 1 1 1 1 1 2
However, what I want (and expected) is:
# Docs and bout class home hours more next night my go to
# 1 1 1 1 1 1 1 1 2 1 2 1
Why does DocumentTermMatrix()
skip the words "my","go"and "to"? Is there a way to control and fix this function?
Upvotes: 4
Views: 1453
Reputation: 7164
DocumentTermMatrix()
automatically discards words that are less than three characters. Therefore, the words to
, my
and go
are not considered when constructing the document-term matrix.
From the help page ?DocumentTermMatrix
, you can see there's an optional argument called control
. This optional argument has a number of default values for numerous things (see the help page ?termFreq
for more details). One of these defaults is a word length of at least three characters, i.e. wordLengths = c(3, Inf)
. You can change this to accommodate for all words, regardless of word length:
dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf))
inspect(dm)
# <<DocumentTermMatrix (documents: 1, terms: 11)>>
# Non-/sparse entries: 11/0
# Sparsity : 0%
# Maximal term length: 5
# Weighting : term frequency (tf)
#
# Terms
# Docs and bout class go home hours more my next night to
# 1 1 1 1 2 1 1 1 1 1 2 2
Upvotes: 6