Oli
Oli

Reputation: 542

R wordstem chopping words too much

I'll show by example:

library(data.table)
dt <- data.table(words = c("finance", "financial", "business"),
                  freq = c(123, 5, 4589))
dt <- dt[, words := SnowballC::wordStem(words, language = "english")]
View(dt)

words    freq
financ    123
financi    5
busi     4589

I thought word stemming would give me finance, finance and business. I would at least expect finance and financial to have the same base word. Im trying to group similar words, it works for some words like have and having both become have, but for some like the above it doesnt seem to work, unless Im misunderstanding?

Upvotes: 2

Views: 200

Answers (1)

epo3
epo3

Reputation: 3121

It seems like your result is what the Porter stemmer algorithm is so supposed to do.

Documentation (Step 4) shows examples of stemming with the suffixes used in your example:

(m>1) AL -> revival -> reviv

(m>1) ANCE -> allowance -> allow

If you want to group your words then you might want to trim them before running wordStem or use string matching functions after stemming (e.g. agrep).

Upvotes: 1

Related Questions