Reputation: 25
It's common to remove stopwords from a text or character vector. I use the function removeWords
from the tm
package.
However, I'm trying to remove all the words except for stopwords. I have a list of words I made called x
. When I use
removeWords(text, x)
I get this error:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), PCRE pattern compilation error 'regular expression is too large'`
I've also tried using grep
:
grep(x, text)
But that won't work, because x
is a vector and not a single character string.
So, how can I remove all the words that aren't in that vector? Or alternatively, how can I select only the words in the vector?
Upvotes: 1
Views: 2130
Reputation: 6778
If you want x
as a regex pattern for grep, just use x <- paste(x, collapse = "|")
, which will allow you to look for those words in text
. But keep in mind that the regex might still be too large. If you want to remove any word that is not a stopword()
, you can create your own function:
keep_stopwords <- function(text) {
stop_regex <- paste(stopwords(), collapse = "\\b|\\b")
stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")
tmp <- strsplit(text, " ")[[1]]
idx <- grepl(stop_regex, tmp)
txt <- paste(tmp[idx], collapse = " ")
return(txt)
}
text = "How much wood would a woodchuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_stopwords(text)
# [1] "would a if a could than most would if could but than other"
Basically, we just setup the stopwords()
as a regex that will look for any of those words. But we have to be careful about partial matches, so we wrap each stop word in \\b
to ensure it's a full match. Then we split the string so that we match each word individually and create an index of the words that are stop words. Then we paste those words together again and return it as a single string.
Here's another approach, which is simpler and easier to understand. It also doesn't rely on regular expressions, which can be expensive in large documents.
keep_words <- function(text, keep) {
words <- strsplit(text, " ")[[1]]
txt <- paste(words[words %in% keep], collapse = " ")
return(txt)
}
x <- "How much wood would a woodchuck chuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_words(x, stopwords())
# [1] "would a if a could than most could if a could but than other"
Upvotes: 3