Reputation: 617
I am using the removeWords and tm_map() functions in the tm package in order to parse some text data. My understanding is that it simply uses Perl regular expressions through gsub() to complete the task.
However, when I run my code, I get a strange error. I am using R 3.3.2.
docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)
And I get...
Error in gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
invalid regular expression '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be|at|as|aren't|are|any|and|an|am|all|against|again|after|above|about|a
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'this version of PCRE is not compiled with Unicode property support'
at '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be| [... truncated]
As I understand it, the important part is "this version of PCRE is not compiled with Unicode property support." Any ideas on how to address this? I ran pcre_config() in R and got the following:
UTF-8 Unicode properties JIT
TRUE FALSE FALSE
And outside of R, I ran pcretest -C and got the following:
PCRE version 7.8 2008-09-05
Compiled with
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
Any feedback would be greatly appreciated.
Upvotes: 1
Views: 795
Reputation: 31
RickyB
I faced the same problem while was trying to create word cloud tool. For some reason "stopwords" function does not work properly.
I found a solution here: Manual removal of stopwords
Here are my codes after making few changes to the codes in the above link:
docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)
Manually removing stopwords:
r <- read.table(fill=TRUE, url("http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/"))
stopWords <- r
vstop <- as.vector(stopWords)
stpWrd <- stopwords("SMART")
text <- unlist(text)[!(unlist(text) %in% stpWrd)]
I hope it helps.
Upvotes: 1