Rollo99
Rollo99

Reputation: 1613

Is there a better way to ignore the plural than "stem = TRUE" in a dfm?

I am pre-processing my data to run an LDA model. I was wondering if there is a better way to ignore plurals (such as "rates", "rate", "contry", "countries") than using "stem = TRUE"? I don't want to stem all the words but only some specific words that appear frequently either plural or singular.

Any hint?

I tried with "stem = TRUE" and I also created a dictionary and used "dictonary=dict" in the dfm code but it obviously graps only the words of the dictionary.

Upvotes: 0

Views: 870

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

The best way to do this is to use a tool to tag your plural nouns, and then to convert these to singular. Unlike the stemmer solution, this will not stem words such as stemming to stem, or quickly to quick, etc.

I recommend using the spacyr package for this, which integrates nicely with quanteda. Here's an example:

library("quanteda")
## Package version: 1.4.3

library("spacyr")

txt <- c(
  "Plurals in English can include irregular words such as stimuli.",
  "One mouse, two mice, one house, two houses."
)
txt_parsed <- spacy_parse(txt, tag = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.1.3, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
txt_parsed
##    doc_id sentence_id token_id     token     lemma   pos tag     entity
## 1   text1           1        1   Plurals    plural  NOUN NNS           
## 2   text1           1        2        in        in   ADP  IN           
## 3   text1           1        3   English   English PROPN NNP LANGUAGE_B
## 4   text1           1        4       can       can  VERB  MD           
## 5   text1           1        5   include   include  VERB  VB           
## 6   text1           1        6 irregular irregular   ADJ  JJ           
## 7   text1           1        7     words      word  NOUN NNS           
## 8   text1           1        8      such      such   ADJ  JJ           
## 9   text1           1        9        as        as   ADP  IN           
## 10  text1           1       10   stimuli  stimulus  NOUN NNS           
## 11  text1           1       11         .         . PUNCT   .           
## 12  text2           1        1       One       one   NUM  CD CARDINAL_B
## 13  text2           1        2     mouse     mouse  NOUN  NN           
## 14  text2           1        3         ,         , PUNCT   ,           
## 15  text2           1        4       two       two   NUM  CD CARDINAL_B
## 16  text2           1        5      mice     mouse  NOUN NNS           
## 17  text2           1        6         ,         , PUNCT   ,           
## 18  text2           1        7       one       one   NUM  CD CARDINAL_B
## 19  text2           1        8     house     house  NOUN  NN           
## 20  text2           1        9         ,         , PUNCT   ,           
## 21  text2           1       10       two       two   NUM  CD CARDINAL_B
## 22  text2           1       11    houses     house  NOUN NNS           
## 23  text2           1       12         .         . PUNCT   .

# replace token with lemma for plural nouns
txt_parsed$token <- ifelse(txt_parsed$tag == "NNS",
  txt_parsed$lemma,
  txt_parsed$token
)

(Of course there are many ways to execute this conditional replacement, including dplyr.)

Now the words that are plural nouns have been replaced by their single noun variants, including the irregular ones such as stimuli and mice, which no stemmer would be smart enough to figure out.

dfmat <- dfm(as.tokens(txt_parsed), remove_punct = TRUE)
dfmat
## Document-feature matrix of: 2 documents, 14 features (50.0% sparse).
## 2 x 14 sparse Matrix of class "dfm"
##        features
## docs    plural in english can include irregular word such as stimulus one
##   text1      1  1       1   1       1         1    1    1  1        1   0
##   text2      0  0       0   0       0         0    0    0  0        0   2
##        features
## docs    mouse two house
##   text1     0   0     0
##   text2     2   2     2

Upvotes: 3

Related Questions