Reputation: 3391
I'm building a term-document matrix using the tm
library.
# Create corpus.
corporize <- function(dir_to_corporize)
{
crp <- Corpus(DirSource(dir_to_corporize, mode="text", encoding="ASCII"),
readerControl=list(reader=readPlain, language="en_EN"))
crp <- tm_map(crp, removeWords, stopwords("english"))
crp <- tm_map(crp, removePunctuation, preserve_intra_word_dashes=F)
crp <- tm_map(crp, removeNumbers)
crp <- tm_map(crp, stripWhitespace)
crp <- tm_map(crp, content_transformer(tolower))
}
However, when I check my term-document matrix, I find a couple of stopwords remained:
the last time i saw
we need talk about kevin
you make me feel like
Why is that and what can I do?
Upvotes: 0
Views: 72
Reputation: 23608
Your order of commands is wrong in your function. If you look at the list of stopword via command stopwords()
you will see that all the stopwords are in lower case. You should first transform everything to lowercase before removing the stopwords, otherwise you will keep words like "I" or words at the beginning of the sentence.
Upvotes: 1