TMOTTM
TMOTTM

Reputation: 3391

Why are stopwords not filtered out in `tm` corporized term-document matrices?

I'm building a term-document matrix using the tm library.

# Create corpus.
corporize <- function(dir_to_corporize)
{
    crp <- Corpus(DirSource(dir_to_corporize, mode="text", encoding="ASCII"),
                 readerControl=list(reader=readPlain, language="en_EN"))
    crp <- tm_map(crp, removeWords, stopwords("english"))
    crp <- tm_map(crp, removePunctuation, preserve_intra_word_dashes=F)
    crp <- tm_map(crp, removeNumbers)
    crp <- tm_map(crp, stripWhitespace)
    crp <- tm_map(crp, content_transformer(tolower))
}

However, when I check my term-document matrix, I find a couple of stopwords remained:

the last time i saw
we need talk about kevin
you make me feel like

Why is that and what can I do?

Upvotes: 0

Views: 72

Answers (1)

phiver
phiver

Reputation: 23608

Your order of commands is wrong in your function. If you look at the list of stopword via command stopwords() you will see that all the stopwords are in lower case. You should first transform everything to lowercase before removing the stopwords, otherwise you will keep words like "I" or words at the beginning of the sentence.

Upvotes: 1

Related Questions