Reputation: 1293

R tm package: failed to remove special character

I want to use R package 'tm' to do some text mining. I want to add some special characters into stopwords.

stop3<-c("()","(3):","article","..","etal.","fig.","natgenet","artical","articleinitiallypublished")
reuters <- tm_map(reuters, removeWords, c(stopwords("english"),stop3))
dtm <- DocumentTermMatrix(reuters)
findFreqTerms(dtm, 20)

However, I found (), etal. and (): cannot be removed from reuters. Anyone know what happened?

Thanks

this what I returned when i use findFreqTerms

findFreqTerms(dtm, 20)
[1] "()."             "():"             "etal."           "found"           "htmlpdfversions" "show"

Upvotes: 0

Answers (1)

Ken Benoit

Reputation: 14902

You could use quanteda which is not bothered by the special characters in the new stopword removal patterns (the (, ) characters).

Tokenizing using what = "fasterword" means you are splitting on the regex pattern of whitespace, and not using stringi to unpack the punctuation characters (which is what happens by default).

stop3 <- c(
  "()", "(3):", "article", "..", "etal.", "fig.", "natgenet",
  "artical", "articleinitiallypublished"
)

library("quanteda")
## Package version: 1.4.3

# import the tm corpus as a quanteda corpus
data(crude, package = "tm")
reuters <- corpus(crude)

# example of removing tokens
(toks <- tokens("this () etal. is in artical fig. two", what = "fasterword"))
## tokens from 1 document.
## text1 :
## [1] "this"    "()"      "etal."   "is"      "in"      "artical" "fig."   
## [8] "two"
tokens_remove(toks, stop3)
## tokens from 1 document.
## text1 :
## [1] "this" "is"   "in"   "two"

# in this problem
dtm <- tokens(reuters, what = "fasterword") %>%
  tokens_remove(c(stopwords("en"), stop3)) %>%
  dfm()
topfeatures(dtm, 20)
##      oil     said     opec   prices      mln     last    crude   reuter 
##       80       52       38       33       29       24       20       20 
##     dlrs    saudi    said.      bpd      one      new official   kuwait 
##       19       18       17       16       15       14       14       13 
##    price   market      pct   sheikh 
##       12       12       12       11

Upvotes: 1

R tm package: failed to remove special character

Answers (1)

Related Questions