Reputation: 101
I am trying to use stemCompletion to convert the stemmed words into complete words.
Following is the code I am using
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus.stemmed <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.unstemmed <- tm_map(myCorpus.stemmed, stemCompletion, dictionary=myCorpusCopy)
if I check the first element for stemmed corpus, it shows me the element correctly
myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"
But if I check the first element of unstemmed corpus, it throws out junk
myCorpus.unstemmed[[1]][1]
$content
[1] NA
Why is the unstemmed corpus not showing the right content?
Upvotes: 0
Views: 689
Reputation: 101
Thanks to answer given by Luke, I looked for a function which can help convert the example text to character vector.
I came across another question with this answer which gives a custom function which can convert text to individual words before applying stemCompletion function.
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
I combined the function with lapply to get a list containing unstemmed version. This returns the right values but its not in the SimpleCorpus data type! I needed to do some jugglery with the output list to convert it into SimpleCorpus data type.
myCorpus.unstemmed <- lapply(myCorpus.stemmed, stemCompletion_mod, dict = myCorpusCopy)
> myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"
> myCorpus.unstemmed[[1]][1]
$content
[1] "once we have a corpus we typically want to the documents in it"
I don't know why stemCompletion didn't complete modifi. But that will be part of another question to explore.
Upvotes: 0
Reputation: 73
I'm only slightly familiar with TM, but doesn't stemCompletion require that the tokens be stems and not already completed words.
Upvotes: 0
Reputation: 54247
Why is the unstemmed corpus not showing the right content?
Since you got a simple corpus object, you are effectively calling
stemCompletion(
x = c("once we have a corpus we typically want to modify the documents in it",
"eg stemming stopword removal et cetera",
"in tm all this functionality is subsumed into the concept of a transformation"),
dictionary=myCorpusCopy
)
which yields
# once we have a corpus we typically want to modify the documents in it
# NA
# eg stemming stopword removal et cetera
# NA
# in tm all this functionality is subsumed into the concept of a transformation
# NA
due to stemCompletion
awaiting a character vector of stems as a first argument (c("once", "we", "have")
), not a character vector of stemmed texts (c("once we have")
).
If you want to complete the stems in your corpus, whatever this is supposed to be good for, you have to pass a character vector of single stems to stemCompletion
(i.e. tokenize each text document, stem-complete the stems, then paste them together again).
Upvotes: 1