From reuters data set in kaggle.com When I used stemDocument of tm package,I see some undesired results: Original word(s) TM's stemDocument Desired inflation infl inflat / inflate unnited states unit states united states many mani many anniversary anniversari anniversary How do I exclude or modify stemming results?

Reputation: 22029

How to exclude words from stemming or how to modify stemming , using stemDocument in tm package in R?

From reuters data set in kaggle.com When I used stemDocument of tm package,I see some undesired results:

Original word(s)	TM's stemDocument	Desired
inflation	infl	inflat / inflate
unnited states	unit states	united states
many	mani	many
anniversary	anniversari	anniversary

How do I exclude or modify stemming results?

Upvotes: 0

Answers (1)

App Work

Reputation: 22029

Using corpus :

trade.train.directory<-"F:\\Reuters_Dataset\\training\\trade"

trade.trnCorpus <- VCorpus(DirSource(directory = trade.train.directory, encoding = "ASCII"))

first list the original words that you want a modified stemming:

```{r}

unStemed.words<-c("anniversary","united", "february","many","inflation","initially")
#stemed.map<-setNames(as.list(unStemed.words),stemed.words)

And then list the corresponding stemmed words as you want:

stemed.words<-c("anniversary","united","february","many","inflate","initial")

Now create a function to mark the original words with a MARK_STRING so that these words are not changed when applying tm.stemDocument :

EXCLUSION_MARK="_EXCLUUUUUU"
     markStemExclusion<- content_transformer(function(corpus){
    
    for(i in 1:length(unStemed.words))
    {
    corpus<-gsub(paste0('\\b',unStemed.words[i],'\\b'),paste0(EXCLUSION_MARK,stemed.words[i],sep="_"),corpus)
    
    }
     return(corpus)
     })

After marking , we are ready apply stemDocument , and latter remove the marking from those excluded words:

 unMarkStemExclusion<-content_transformer( function(corpus)
{
 corpus<-gsub(EXCLUSION_MARK," ",corpus)
return (corpus)
   
})

Now call other the data cleaning methods one by one:

finally you will see the desired result:

 cleanData<-function(corpus,excludeStopWords=FALSE)
    {
    corpus <- tm_map(corpus, removeNumbers)
    corpus<- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus,replaceAbbreviations)

call markStemExclusion

corpus<-tm_map(corpus,markStemExclusion)



if(excludeStopWords==FALSE){
corpus <-tm_map(corpus,removeWords,c(
                                      "said", "will","next",
                                     stopwords("english")))}
corpus<-tm_map(corpus,cleanHtmlTags)

call stemDocument

corpus <- tm_map(corpus, stemDocument)

call unmarkStemExclusion

corpus<-tm_map(corpus,unMarkStemExclusion)


corpus<-tm_map(corpus,replacePunctBySpace)
corpus<- tm_map(corpus, stripWhitespace)
return (corpus)
}