Abhishek Sourabh
Abhishek Sourabh

Reputation: 101

stemCompletion is not working properly

I am trying to use stemCompletion to convert the stemmed words into complete words.

Following is the code I am using

txt <- c("Once we have a corpus we typically want to modify the documents in it",
     "e.g., stemming, stopword removal, et cetera.",
     "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

# *Removing common word endings* (e.g., "ing", "es") 
myCorpus.stemmed <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.unstemmed <- tm_map(myCorpus.stemmed, stemCompletion, dictionary=myCorpusCopy)

if I check the first element for stemmed corpus, it shows me the element correctly

myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"

But if I check the first element of unstemmed corpus, it throws out junk

myCorpus.unstemmed[[1]][1]
$content
[1] NA

Why is the unstemmed corpus not showing the right content?

Upvotes: 0

Views: 689

Answers (3)

Abhishek Sourabh
Abhishek Sourabh

Reputation: 101

Thanks to answer given by Luke, I looked for a function which can help convert the example text to character vector.

I came across another question with this answer which gives a custom function which can convert text to individual words before applying stemCompletion function.

stemCompletion_mod <- function(x,dict=dictCorpus) {
 PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

I combined the function with lapply to get a list containing unstemmed version. This returns the right values but its not in the SimpleCorpus data type! I needed to do some jugglery with the output list to convert it into SimpleCorpus data type.

myCorpus.unstemmed <- lapply(myCorpus.stemmed, stemCompletion_mod, dict = myCorpusCopy)

> myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"

 > myCorpus.unstemmed[[1]][1]
$content
[1] "once we have a corpus we typically want to the documents in it"

I don't know why stemCompletion didn't complete modifi. But that will be part of another question to explore.

Upvotes: 0

Cameron Kay
Cameron Kay

Reputation: 73

I'm only slightly familiar with TM, but doesn't stemCompletion require that the tokens be stems and not already completed words.

Upvotes: 0

lukeA
lukeA

Reputation: 54247

Why is the unstemmed corpus not showing the right content?

Since you got a simple corpus object, you are effectively calling

stemCompletion(
  x = c("once we have a corpus we typically want to modify the documents in it", 
        "eg stemming stopword removal et cetera", 
        "in tm all this functionality is subsumed into the concept of a transformation"),
  dictionary=myCorpusCopy
)

which yields

# once we have a corpus we typically want to modify the documents in it 
# NA 
# eg stemming stopword removal et cetera 
# NA 
# in tm all this functionality is subsumed into the concept of a transformation 
# NA 

due to stemCompletion awaiting a character vector of stems as a first argument (c("once", "we", "have")), not a character vector of stemmed texts (c("once we have")).

If you want to complete the stems in your corpus, whatever this is supposed to be good for, you have to pass a character vector of single stems to stemCompletion (i.e. tokenize each text document, stem-complete the stems, then paste them together again).

Upvotes: 1

Related Questions