Timon Boehm
Timon Boehm

Reputation: 21

Customized word stemming in R quanteda

Does anyone know how to treat a group of features of a dtm as a SINGLE feature? The problem is that the various standard stemming possibilities such as tokens_wordstem or dfm_wordstem dont do a good job in my case, so I want to define by hand customized features, for instance “eat” for “eat”, “eater”, “ate”.

Upvotes: 2

Views: 276

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

You could do it as a dictionary, if you were able to implement your own rules:

library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

toks <- tokens("The eater ate when I eat and she eats.")

dict <- dictionary(list(eat = c("eat*", "ate")))

# original tokens
toks
## Tokens consisting of 1 document.
## text1 :
##  [1] "The"   "eater" "ate"   "when"  "I"     "eat"   "and"   "she"   "eats" 
## [10] "."
# after custom "stemmer"
tokens_lookup(toks, dict, exclusive = FALSE, capkeys = FALSE)
## Tokens consisting of 1 document.
## text1 :
##  [1] "The"  "eat"  "eat"  "when" "I"    "eat"  "and"  "she"  "eat"  "."

# as a dfm
(dfmat <- dfm(toks))
## Document-feature matrix of: 1 document, 10 features (0.00% sparse) and 0 docvars.
##        features
## docs    the eater ate when i eat and she eats .
##   text1   1     1   1    1 1   1   1   1    1 1
dfm_lookup(dfmat, dict, exclusive = FALSE, capkeys = FALSE)
## Document-feature matrix of: 1 document, 7 features (0.00% sparse) and 0 docvars.
##        features
## docs    the eat when i and she .
##   text1   1   4    1 1   1   1 1

Upvotes: 1

Related Questions