Reputation: 1613
I am pre-processing my data to run an LDA model. I was wondering if there is a better way to ignore plurals (such as "rates", "rate", "contry", "countries") than using "stem = TRUE"? I don't want to stem all the words but only some specific words that appear frequently either plural or singular.
Any hint?
I tried with "stem = TRUE"
and I also created a dictionary and used "dictonary=dict"
in the dfm code but it obviously graps only the words of the dictionary.
Upvotes: 0
Views: 870
Reputation: 14902
The best way to do this is to use a tool to tag your plural nouns, and then to convert these to singular. Unlike the stemmer solution, this will not stem words such as stemming to stem, or quickly to quick, etc.
I recommend using the spacyr package for this, which integrates nicely with quanteda. Here's an example:
library("quanteda")
## Package version: 1.4.3
library("spacyr")
txt <- c(
"Plurals in English can include irregular words such as stimuli.",
"One mouse, two mice, one house, two houses."
)
txt_parsed <- spacy_parse(txt, tag = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.1.3, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
txt_parsed
## doc_id sentence_id token_id token lemma pos tag entity
## 1 text1 1 1 Plurals plural NOUN NNS
## 2 text1 1 2 in in ADP IN
## 3 text1 1 3 English English PROPN NNP LANGUAGE_B
## 4 text1 1 4 can can VERB MD
## 5 text1 1 5 include include VERB VB
## 6 text1 1 6 irregular irregular ADJ JJ
## 7 text1 1 7 words word NOUN NNS
## 8 text1 1 8 such such ADJ JJ
## 9 text1 1 9 as as ADP IN
## 10 text1 1 10 stimuli stimulus NOUN NNS
## 11 text1 1 11 . . PUNCT .
## 12 text2 1 1 One one NUM CD CARDINAL_B
## 13 text2 1 2 mouse mouse NOUN NN
## 14 text2 1 3 , , PUNCT ,
## 15 text2 1 4 two two NUM CD CARDINAL_B
## 16 text2 1 5 mice mouse NOUN NNS
## 17 text2 1 6 , , PUNCT ,
## 18 text2 1 7 one one NUM CD CARDINAL_B
## 19 text2 1 8 house house NOUN NN
## 20 text2 1 9 , , PUNCT ,
## 21 text2 1 10 two two NUM CD CARDINAL_B
## 22 text2 1 11 houses house NOUN NNS
## 23 text2 1 12 . . PUNCT .
# replace token with lemma for plural nouns
txt_parsed$token <- ifelse(txt_parsed$tag == "NNS",
txt_parsed$lemma,
txt_parsed$token
)
(Of course there are many ways to execute this conditional replacement, including dplyr.)
Now the words that are plural nouns have been replaced by their single noun variants, including the irregular ones such as stimuli and mice, which no stemmer would be smart enough to figure out.
dfmat <- dfm(as.tokens(txt_parsed), remove_punct = TRUE)
dfmat
## Document-feature matrix of: 2 documents, 14 features (50.0% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs plural in english can include irregular word such as stimulus one
## text1 1 1 1 1 1 1 1 1 1 1 0
## text2 0 0 0 0 0 0 0 0 0 0 2
## features
## docs mouse two house
## text1 0 0 0
## text2 2 2 2
Upvotes: 3