Reputation: 39
I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'
Upvotes: 2
Views: 4743
Reputation: 684
The SemNetCleaner package has a singularize function. It's slower than the pluralize package but its handling of nouns is better, I find. For example, Mars is not converted into Mar.
Upvotes: 1
Reputation: 110004
One possible solution. Here I use the pacman package to make the solution self contained:
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh('hrbrmstr/pluralize')
p_load(quanteda)
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"'
singularize(unlist(tokenize(x)))
## [1] "\"" "nation" "\"" "and" "\"" "nation" "\""
## [8] "to" "be" "counted" "a" "the" "same" "word"
## [15] "and" "ideally" "\"" "dictionary" "\"" "and" "\""
## [22] "dictionary" "\""
Upvotes: 8