removeSparseTerms with training and testing set

Question

When I use the tm package for text mining, I'll often follow a workflow very similar to this:

library(tm)
data(crude)
crude = tm_map(crude, tolower)
crude = tm_map(crude, removePunctuation)
crude = tm_map(crude, removeWords, stopwords("english"))
crude = tm_map(crude, stemDocument)
dtm = DocumentTermMatrix(crude)
sparse = as.data.frame(as.matrix(removeSparseTerms(dtm, 0.8)))
spl = runif(length(crude)) < 0.7
train = subset(sparse, spl)
test = subset(sparse, !spl)

Basically, I preprocess the corpus, build a document-term matrix, remove sparse terms, and then split into a training and testing set.

While this is very easy with the tm package, something I don't like about it is that it implicitly uses both the training and the testing set to determine which terms are included (aka removeSparseTerms is called before I split into a training and testing set). While this might not be too bad with a random training/testing set split since we would expect word frequencies to be similar between the training and testing sets, it could materially affect the split in a non-random split (e.g. when using sequential observations).

I am wondering if anybody has a relatively simple way (with tm) to move the training/testing split earlier, remove sparse terms based on the word frequencies in the training set only, and then remove terms from the testing set so its columns match those of the training set.

Jake Burkhead · Accepted Answer

library(tm)
library(Rstem)
data(crude)
set.seed(1)

spl <- runif(length(crude)) < 0.7
train <- crude[spl]
test <- crude[!spl]

controls <- list(
    tolower = TRUE,
    removePunctuation = TRUE,
    stopwords = stopwords("english"),
    stemming = function(word) wordStem(word, language = "english")
    )

train_dtm <- DocumentTermMatrix(train, controls)

train_dtm <- removeSparseTerms(train_dtm, 0.8)

test_dtm <- DocumentTermMatrix(
    test,
    c(controls, dictionary = list(dimnames(train_dtm)$Terms))
    )

## train_dtm
## A document-term matrix (13 documents, 91 terms)
##
## Non-/sparse entries: 405/778
## Sparsity           : 66%
## Maximal term length: 9
## Weighting          : term frequency (tf)

## test_dtm
## A document-term matrix (7 documents, 91 terms)
##
## Non-/sparse entries: 149/488
## Sparsity           : 77%
## Maximal term length: 9
## Weighting          : term frequency (tf)

## all(dimnames(train_dtm)$Terms == dimnames(test_dtm)$Terms)
## [1] TRUE

I had issues using the default stemmer. Also there is a bounds option for controls, but I couldn't get the same results as removeSparseTerms when using it. I tried bounds = list(local = c(0.2 * length(train), Inf)) with floor and ceiling with no luck.

removeSparseTerms with training and testing set

Answers (1)

Related Questions