R wordstem chopping words too much

Question

I'll show by example:

library(data.table)
dt <- data.table(words = c("finance", "financial", "business"),
                  freq = c(123, 5, 4589))
dt <- dt[, words := SnowballC::wordStem(words, language = "english")]
View(dt)

words    freq
financ    123
financi    5
busi     4589

I thought word stemming would give me finance, finance and business. I would at least expect finance and financial to have the same base word. Im trying to group similar words, it works for some words like have and having both become have, but for some like the above it doesnt seem to work, unless Im misunderstanding?

epo3 · Accepted Answer

It seems like your result is what the Porter stemmer algorithm is so supposed to do.

Documentation (Step 4) shows examples of stemming with the suffixes used in your example:

(m>1) AL -> revival -> reviv

(m>1) ANCE -> allowance -> allow

If you want to group your words then you might want to trim them before running wordStem or use string matching functions after stemming (e.g. agrep).

R wordstem chopping words too much

Answers (1)

Related Questions