Reputation: 542
I'll show by example:
library(data.table)
dt <- data.table(words = c("finance", "financial", "business"),
freq = c(123, 5, 4589))
dt <- dt[, words := SnowballC::wordStem(words, language = "english")]
View(dt)
words freq
financ 123
financi 5
busi 4589
I thought word stemming would give me finance, finance and business. I would at least expect finance and financial to have the same base word. Im trying to group similar words, it works for some words like have and having both become have, but for some like the above it doesnt seem to work, unless Im misunderstanding?
Upvotes: 2
Views: 200
Reputation: 3121
It seems like your result is what the Porter stemmer algorithm is so supposed to do.
Documentation (Step 4) shows examples of stemming with the suffixes used in your example:
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
If you want to group your words then you might want to trim them before running wordStem
or use string matching functions after stemming (e.g. agrep
).
Upvotes: 1