kabauter
kabauter

Reputation: 99

R removewords tm treats stop word file as regex not verbatim

How can I force removeWords from library(tm) to take each word in a stop word list verbatim (literally), not as a regex?

Suppose I have a file stopwordlist.txt containing characters that can be misinterpreted as regular expressions:

 e.g.
 "
 .net
 ...
 \
 ***p<

This is my code

library(tm)
...
custom_stopwords <- read.delim2("stopwordlist.txt", header = FALSE, sep = "\n", quote = "", fill = TRUE, comment.char = "")
...
corpus = tm_map(corpus, removeWords, custom_stopwords$V1)

I would expect removeWords to take each line as a verbatim stop word, for example to remove each occurrence of "e.g." and not the word "ergo" when taken as a regexp. Having some special characters confuses the interpreter saying it is not a valid regexp.

Upvotes: 0

Views: 85

Answers (1)

Hayden Y.
Hayden Y.

Reputation: 448

Maybe try creating an alternate version of the stop list just to use with removeWords that includes the escape characters? This way at least you don't need to manually change every '.' to '\.'

escaped_stopwords<-gsub("(\\.|\\*|\")", "\\\\\\1", custom_stopwords$V1, perl=T)

Upvotes: 1

Related Questions