Reputation: 1351
Corpus created, stopwords defined, cleansing done (removePunctuation, removeNumbers, tolower...).
The corpus is now ready to be stemmed. The function is executed correctly and all works as it should, but...
I need to know which words are being stemmed to each common root. Is that possible using the tm package? Or any other package?
For example, TermA1, TermA2, TermB1, TermB2, TermB3, all of them are stemmed to Term and my new Corpus reflect only Term. However, I need also to know which words are associated with each root word, and therefore an optimal output should be:
Term Stemm
TermA1 Term
TermA2 Term
TermB1 Term
TermB2 Term
TermB3 Term
...
WordA1 Word
WordB1 Word
WordB2 Word
WordB3 Word
WordC1 Word
Upvotes: 0
Views: 310
Reputation: 412
In the tm package there is the function stemCompletion that allows you to complete each stemmed word given a specific dictionary.
To obtain your output do as follows:
library(tm)
data("crude")
words <- stemCompletion(c("compan", "entit", "suppl"), crude)
stemmed <- names(words)
stemcomp <- unname(words)
data.table(stemmed, stemcomp)
References: stemCompletion {tm}
[UPDATE: more german words]
I tried this to verify the behavior with german vowels:
library(SnowballC)
library(tm)
library(data.table)
text <- c("für", "aktuelle", "Nachrichten", "und", "Themen", "Bilder",
"und", "Videos", "aus", "den", "Bereichen", "News", "Wirtschaft","Politik","können", "Fremdschämen", "Lebensmüde", "Erklärungsnot")
stem <- stemmed <- wordStem(text, language = "porter")
completed <- stemCompletion(stemmed, text)
comparison <- data.table(text, stemmed, completed)
In the table comparison you can see that the original words with the german vowels are not being stemmed but, if you try to complete a certain given stem like "f" with stemCompletion("f", text)
you will obtain the correct word "für".
This is strange, maybe you can follow from here and try to find some work around.
Upvotes: 2