RickyB
RickyB

Reputation: 617

R errors because of PCRE configuration, unicode properties

I am using the removeWords and tm_map() functions in the tm package in order to parse some text data. My understanding is that it simply uses Perl regular expressions through gsub() to complete the task.

However, when I run my code, I get a strange error. I am using R 3.3.2.

docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)

And I get...

Error in gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  : 
  invalid regular expression '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be|at|as|aren't|are|any|and|an|am|all|against|again|after|above|about|a
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'this version of PCRE is not compiled with Unicode property support'
    at '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be| [... truncated]

As I understand it, the important part is "this version of PCRE is not compiled with Unicode property support." Any ideas on how to address this? I ran pcre_config() in R and got the following:

     UTF-8 Unicode properties                JIT 
      TRUE              FALSE              FALSE 

And outside of R, I ran pcretest -C and got the following:

PCRE version 7.8 2008-09-05
Compiled with
  UTF-8 support
  Unicode properties support
  Newline sequence is LF
  \R matches all Unicode newlines
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack

Any feedback would be greatly appreciated.

Upvotes: 1

Views: 795

Answers (1)

Nina Omani
Nina Omani

Reputation: 31

RickyB

I faced the same problem while was trying to create word cloud tool. For some reason "stopwords" function does not work properly.

I found a solution here: Manual removal of stopwords

Here are my codes after making few changes to the codes in the above link:

docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)

Manually removing stopwords:

r <- read.table(fill=TRUE, url("http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/"))
stopWords <- r
vstop <- as.vector(stopWords)
stpWrd <- stopwords("SMART")
text <- unlist(text)[!(unlist(text) %in% stpWrd)]

I hope it helps.

Upvotes: 1

Related Questions