abbassix
abbassix

Reputation: 645

How to remove common word endings from a non-English corpus using the tm package?

I am trying to do some text mining, using tm package, on reviews that Italian users of a certain website wrote there. I scraped the texts, stored them on a corpus, did some sort of cleaning, but when I try to get the stems of the words by removing the common endings, I have problem specifying the Italian language instead of default one, i.e. English.

reviews_corpus <- tm_map(reviews_corpus, removeNumbers)
reviews_corpus <- tm_map(reviews_corpus, removePunctuation)
reviews_corpus <- tm_map(reviews_corpus, stripWhitespace)
reviews_corpus <- tm_map(reviews_corpus, content_transformer(tolower))
reviews_corpus <- tm_map(reviews_corpus, removeWords, stopwords("italian"))
reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))

First five lines work fine, but for the last one R gives me:

Error in UseMethod("stemDocument", x) : 
  no applicable method for 'stemDocument' applied to an object of class "c('VCorpus', 'Corpus')"

So, my problem is that how can I use stemDocument on a corpus but specify the language I want to be used?

Upvotes: 1

Views: 330

Answers (1)

phiver
phiver

Reputation: 23608

There is a bug in stemDocument. If you use any other language than English, it reverts back to English. But there is a way around it and directly call the word stemmer that stemDocument points to.

Instead of

reviews_corpus <- tm_map(reviews_corpus, stemDocument(reviews_corpus, language="italian"))

use

reviews_corpus <- tm_map(reviews_corpus, function(x) SnowballC::wordStem(x, language = "italian"))

But my advice is, if you are using a non English language, to use the quanteda package.

Upvotes: 3

Related Questions