Kobe-Wan Kenobi
Kobe-Wan Kenobi

Reputation: 3864

StandardAnalyzer with stemming

Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way?

Also, if I would like not to consider numbers, how can I achieve that?

Thanks

Upvotes: 4

Views: 3814

Answers (1)

aalbahem
aalbahem

Reputation: 782

If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer. Otherwise, you could create a new Analyzer that extends the AnalyzerWraper as shown below.

import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;


public class PorterAnalyzer extends AnalyzerWrapper {

  private Analyzer baseAnalyzer;

  public PorterAnalyzer(Analyzer baseAnalyzer) {
      this.baseAnalyzer = baseAnalyzer;
  }

  @Override
  public void close() {
      baseAnalyzer.close();
      super.close();
  }

  @Override
  protected Analyzer getWrappedAnalyzer(String fieldName)
  {
      return baseAnalyzer;
  }

  @Override
  protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
  {
      TokenStream ts = components.getTokenStream();
      Set<String> filteredTypes = new HashSet<>();
      filteredTypes.add("<NUM>");
      TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);

      PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
      return new TokenStreamComponents(components.getTokenizer(), porterStem);
  }

  public static void main(String[] args) throws IOException
  {

      //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
      PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
      String text = "This is a testing example. It should tests the Porter stemmer version 111";

      TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
      ts.reset();

      while (ts.incrementToken()){
          CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);

          System.out.println(ca.toString());
      }
      analyzer.close();
  }

}

The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.

Upvotes: 3

Related Questions