Reputation: 1293
I want to use R package 'tm' to do some text mining. I want to add some special characters into stopwords.
stop3<-c("()","(3):","article","..","etal.","fig.","natgenet","artical","articleinitiallypublished")
reuters <- tm_map(reuters, removeWords, c(stopwords("english"),stop3))
dtm <- DocumentTermMatrix(reuters)
findFreqTerms(dtm, 20)
However, I found (), etal. and (): cannot be removed from reuters. Anyone know what happened?
Thanks
this what I returned when i use findFreqTerms
findFreqTerms(dtm, 20)
[1] "()." "():" "etal." "found" "htmlpdfversions" "show"
Upvotes: 0
Views: 375
Reputation: 14902
You could use quanteda which is not bothered by the special characters in the new stopword removal patterns (the (
, )
characters).
Tokenizing using what = "fasterword"
means you are splitting on the regex pattern of whitespace, and not using stringi to unpack the punctuation characters (which is what happens by default).
stop3 <- c(
"()", "(3):", "article", "..", "etal.", "fig.", "natgenet",
"artical", "articleinitiallypublished"
)
library("quanteda")
## Package version: 1.4.3
# import the tm corpus as a quanteda corpus
data(crude, package = "tm")
reuters <- corpus(crude)
# example of removing tokens
(toks <- tokens("this () etal. is in artical fig. two", what = "fasterword"))
## tokens from 1 document.
## text1 :
## [1] "this" "()" "etal." "is" "in" "artical" "fig."
## [8] "two"
tokens_remove(toks, stop3)
## tokens from 1 document.
## text1 :
## [1] "this" "is" "in" "two"
# in this problem
dtm <- tokens(reuters, what = "fasterword") %>%
tokens_remove(c(stopwords("en"), stop3)) %>%
dfm()
topfeatures(dtm, 20)
## oil said opec prices mln last crude reuter
## 80 52 38 33 29 24 20 20
## dlrs saudi said. bpd one new official kuwait
## 19 18 17 16 15 14 14 13
## price market pct sheikh
## 12 12 12 11
Upvotes: 1