Reputation: 645
I am trying to do some text mining, using tm package, on reviews that Italian users of a certain website wrote there. I scraped the texts, stored them on a corpus, did some sort of cleaning, but when I try to get the stems of the words by removing the common endings, I have problem specifying the Italian language instead of default one, i.e. English.
reviews_corpus <- tm_map(reviews_corpus, removeNumbers)
reviews_corpus <- tm_map(reviews_corpus, removePunctuation)
reviews_corpus <- tm_map(reviews_corpus, stripWhitespace)
reviews_corpus <- tm_map(reviews_corpus, content_transformer(tolower))
reviews_corpus <- tm_map(reviews_corpus, removeWords, stopwords("italian"))
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
First five lines work fine, but for the last one R gives me:
Error in UseMethod("stemDocument", x) :
no applicable method for 'stemDocument' applied to an object of class "c('VCorpus', 'Corpus')"
So, my problem is that how can I use stemDocument on a corpus but specify the language I want to be used?
Upvotes: 1
Views: 330
Reputation: 23608
There is a bug in stemDocument
. If you use any other language than English, it reverts back to English. But there is a way around it and directly call the word stemmer that stemDocument
points to.
Instead of
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))
use
reviews_corpus <- tm_map(reviews_corpus, function(x) SnowballC::wordStem(x, language = "italian"))
But my advice is, if you are using a non English language, to use the quanteda package.
Upvotes: 3