Reputation: 11
I am trying to get the list of all the stemmed words along with its original form.
here is an example
library(tm)
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
corpus<-Corpus(VectorSource(text))
corpus<-tm_map(corpus,stemDocument)
I am looking for answer like this in data frame
orginal_word stemmed
Impressed Impress
shipping ship
very veri
helpful help
wonderful wonder
experience experi
Upvotes: 0
Views: 1273
Reputation: 1482
This is a little more efficient than @jazzurro's answer:
library("corpus")
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
word <- text_types(text, collapse = TRUE, drop = stopwords_en, drop_punct = TRUE)
stem <- SnowballC::wordStem(word, "english")
data.frame(word, stem)
Result:
word stem
1 arrived arriv
2 days day
3 earlier earlier
4 expected expect
5 experience experi
6 helpful help
7 impressed impress
8 shipping ship
9 time time
10 wonderful wonder
(The text_types
function also accepts tm
Corpus objects if that matters to you.)
Upvotes: 2
Reputation: 23574
This may be something helpful for you. There is a function called wordStem()
in the SnowballC
package. Using it, you could do the following. Since I use unnest_tokens()
in the tidytext
package, I created a data frame first. The function splits the words and create a long-format data set. It seems that you want to remove stop words, so I did the using filter()
. The final step is the crucial one for you. I used wordStem()
in the SnowballC
package to extract stems for the words remaining in the data. The outcome may not be exactly what you want. But I hope this will help you to some extent.
library(dplyr)
library(tidytext)
library(SnowballC)
mydf <- data_frame(id = 1:length(text),
text = text)
data(stop_words)
mydf %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% stop_words$word) %>%
mutate(stem = wordStem(word))
# id word stem
# <int> <chr> <chr>
# 1 1 impressed impress
# 2 1 shipping ship
# 3 1 time time
# 4 1 arrived arriv
# 5 1 days dai
# 6 1 earlier earlier
# 7 1 expected expect
# 8 2 helpful help
# 9 3 wonderful wonder
#10 3 experience experi
Upvotes: 0