Reputation: 22029
From reuters data set in kaggle.com When I used stemDocument of tm package,I see some undesired results:
Original word(s) | TM's stemDocument | Desired |
---|---|---|
inflation | infl | inflat / inflate |
unnited states | unit states | united states |
many | mani | many |
anniversary | anniversari | anniversary |
How do I exclude or modify stemming results?
Upvotes: 0
Views: 82
Reputation: 22029
trade.train.directory<-"F:\\Reuters_Dataset\\training\\trade"
trade.trnCorpus <- VCorpus(DirSource(directory = trade.train.directory, encoding = "ASCII"))
first list the original words that you want a modified stemming:
```{r}
unStemed.words<-c("anniversary","united", "february","many","inflation","initially")
#stemed.map<-setNames(as.list(unStemed.words),stemed.words)
And then list the corresponding stemmed words as you want:
stemed.words<-c("anniversary","united","february","many","inflate","initial")
Now create a function to mark the original words with a MARK_STRING so that these words are not changed when applying tm.stemDocument :
EXCLUSION_MARK="_EXCLUUUUUU"
markStemExclusion<- content_transformer(function(corpus){
for(i in 1:length(unStemed.words))
{
corpus<-gsub(paste0('\\b',unStemed.words[i],'\\b'),paste0(EXCLUSION_MARK,stemed.words[i],sep="_"),corpus)
}
return(corpus)
})
After marking , we are ready apply stemDocument , and latter remove the marking from those excluded words:
unMarkStemExclusion<-content_transformer( function(corpus)
{
corpus<-gsub(EXCLUSION_MARK," ",corpus)
return (corpus)
})
finally you will see the desired result:
cleanData<-function(corpus,excludeStopWords=FALSE)
{
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus,replaceAbbreviations)
corpus<-tm_map(corpus,markStemExclusion)
if(excludeStopWords==FALSE){
corpus <-tm_map(corpus,removeWords,c(
"said", "will","next",
stopwords("english")))}
corpus<-tm_map(corpus,cleanHtmlTags)
corpus <- tm_map(corpus, stemDocument)
corpus<-tm_map(corpus,unMarkStemExclusion)
corpus<-tm_map(corpus,replacePunctBySpace)
corpus<- tm_map(corpus, stripWhitespace)
return (corpus)
}
Upvotes: 0