Reputation: 99
How can I force removeWords from library(tm) to take each word in a stop word list verbatim (literally), not as a regex?
Suppose I have a file stopwordlist.txt containing characters that can be misinterpreted as regular expressions:
e.g.
"
.net
...
\
***p<
This is my code
library(tm)
...
custom_stopwords <- read.delim2("stopwordlist.txt", header = FALSE, sep = "\n", quote = "", fill = TRUE, comment.char = "")
...
corpus = tm_map(corpus, removeWords, custom_stopwords$V1)
I would expect removeWords to take each line as a verbatim stop word, for example to remove each occurrence of "e.g." and not the word "ergo" when taken as a regexp. Having some special characters confuses the interpreter saying it is not a valid regexp.
Upvotes: 0
Views: 85
Reputation: 448
Maybe try creating an alternate version of the stop list just to use with removeWords that includes the escape characters? This way at least you don't need to manually change every '.' to '\.'
escaped_stopwords<-gsub("(\\.|\\*|\")", "\\\\\\1", custom_stopwords$V1, perl=T)
Upvotes: 1