Reputation: 44340
When I use the tm
package for text mining, I'll often follow a workflow very similar to this:
library(tm)
data(crude)
crude = tm_map(crude, tolower)
crude = tm_map(crude, removePunctuation)
crude = tm_map(crude, removeWords, stopwords("english"))
crude = tm_map(crude, stemDocument)
dtm = DocumentTermMatrix(crude)
sparse = as.data.frame(as.matrix(removeSparseTerms(dtm, 0.8)))
spl = runif(length(crude)) < 0.7
train = subset(sparse, spl)
test = subset(sparse, !spl)
Basically, I preprocess the corpus, build a document-term matrix, remove sparse terms, and then split into a training and testing set.
While this is very easy with the tm
package, something I don't like about it is that it implicitly uses both the training and the testing set to determine which terms are included (aka removeSparseTerms
is called before I split into a training and testing set). While this might not be too bad with a random training/testing set split since we would expect word frequencies to be similar between the training and testing sets, it could materially affect the split in a non-random split (e.g. when using sequential observations).
I am wondering if anybody has a relatively simple way (with tm
) to move the training/testing split earlier, remove sparse terms based on the word frequencies in the training set only, and then remove terms from the testing set so its columns match those of the training set.
Upvotes: 2
Views: 3331
Reputation: 6545
library(tm)
library(Rstem)
data(crude)
set.seed(1)
spl <- runif(length(crude)) < 0.7
train <- crude[spl]
test <- crude[!spl]
controls <- list(
tolower = TRUE,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = function(word) wordStem(word, language = "english")
)
train_dtm <- DocumentTermMatrix(train, controls)
train_dtm <- removeSparseTerms(train_dtm, 0.8)
test_dtm <- DocumentTermMatrix(
test,
c(controls, dictionary = list(dimnames(train_dtm)$Terms))
)
## train_dtm
## A document-term matrix (13 documents, 91 terms)
##
## Non-/sparse entries: 405/778
## Sparsity : 66%
## Maximal term length: 9
## Weighting : term frequency (tf)
## test_dtm
## A document-term matrix (7 documents, 91 terms)
##
## Non-/sparse entries: 149/488
## Sparsity : 77%
## Maximal term length: 9
## Weighting : term frequency (tf)
## all(dimnames(train_dtm)$Terms == dimnames(test_dtm)$Terms)
## [1] TRUE
I had issues using the default stemmer. Also there is a bounds
option for controls, but I couldn't get the same results as removeSparseTerms
when using it. I tried bounds = list(local = c(0.2 * length(train), Inf))
with floor
and ceiling
with no luck.
Upvotes: 3