Reputation: 21
Does anyone know how to treat a group of features of a dtm as a SINGLE feature? The problem is that the various standard stemming possibilities such as tokens_wordstem or dfm_wordstem dont do a good job in my case, so I want to define by hand customized features, for instance “eat” for “eat”, “eater”, “ate”.
Upvotes: 2
Views: 276
Reputation: 14902
You could do it as a dictionary, if you were able to implement your own rules:
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens("The eater ate when I eat and she eats.")
dict <- dictionary(list(eat = c("eat*", "ate")))
# original tokens
toks
## Tokens consisting of 1 document.
## text1 :
## [1] "The" "eater" "ate" "when" "I" "eat" "and" "she" "eats"
## [10] "."
# after custom "stemmer"
tokens_lookup(toks, dict, exclusive = FALSE, capkeys = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "The" "eat" "eat" "when" "I" "eat" "and" "she" "eat" "."
# as a dfm
(dfmat <- dfm(toks))
## Document-feature matrix of: 1 document, 10 features (0.00% sparse) and 0 docvars.
## features
## docs the eater ate when i eat and she eats .
## text1 1 1 1 1 1 1 1 1 1 1
dfm_lookup(dfmat, dict, exclusive = FALSE, capkeys = FALSE)
## Document-feature matrix of: 1 document, 7 features (0.00% sparse) and 0 docvars.
## features
## docs the eat when i and she .
## text1 1 4 1 1 1 1 1
Upvotes: 1