Reputation: 1
I'm processing some Indonesian texts in a Java application, and I need to stem them.
Currently I am using lucene indonesian stemmer. org.apache.lucene.analysis.id.IndonesianAnalyzer;
but results are not satisfactory.
Could anyone suggest me different stemmer?
Upvotes: 0
Views: 832
Reputation: 33351
"enang" is a stem. Stems need not be actual words. For instance, in English, "argue" "argues" and "arguing" reduce to the stem "argu". "argu" isn't an english word, but it is a meaningful stem. This is how stemmers work. As long as you apply the stemmer the same way to the indexed data and the query, it should work well.
If you don't want behavior like that, it doesn't make any sense to use a stemmer at all.
Aside from the stemmer, IndonesianAnalyzer is fairly easily replicated. It's other components just involve a StandardTokenizer
, StandardFilter
, LowercaseAnalyzer
, and a StopFilter
. That's just a StandardAnalyzer
with an Indonesian stopword set, when you get right down to it, so you can create an Indonesiananalyzer without the stemmer as simply as:
//If you are using the default stopword location defined in the IndonesianAnalyzer you could load them like this.
CharArraySet defaultStopSet = StopwordAnalyzerBaseloadStopwordSet(false, IndonesianAnalyzer.class, IndonesianAnalyzer.DEFAULT_STOPWORD_FILE, "#");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, defaultStopSet);
I'm not sure whether you would run into problems just passing a reader on the default stop word file into the StandardAnalyzer constructor.
Upvotes: 2