Person1
Person1

Reputation: 67

Extending Lucene Analyzer

I have special analyzing needs in Lucene, but I want to keep using parts of the StandardAnalyzer mechanism.

In particular, I want the string

"-apple--carrot- tomato?"

to be tokenize into:

  1. "-apple-" 2. "-carrot-" 3. "tomato"

(strings surrounded with -- are treated as a seperate token)

It seems that to achieve this, I have to customize the analyzer and the tokenizer. But do I have to rewrite it from scratch? for example I don't want to have to tell the tokenizer (or token filter) that it should ommit the question mark in "apple?".

Is there a way to just modify existing analyzer?

Upvotes: 0

Views: 955

Answers (1)

Mysterion
Mysterion

Reputation: 9320

Basically, you couldn't extend StandardAnalyzer, since it's final class. But you could do the same trick, with your own tokenizer, and it's simple. Also you couldn't change existing one, since it's a bad idea.

I could imagine something like this:

public class CustomAnalyzer extends Analyzer {

    protected TokenStreamComponents createComponents(String s) {
        // provide your own tokenizer, that will split input string as you want it
        final Tokenizer standardTokenizer = new MyStandardTokenizer();

        TokenStream tok = new StandardFilter(standardTokenizer);
        // make everything lowercase, remove if not needed
        tok = new LowerCaseFilter(tok);
        //provide stopwords if you want them
        tok = new StopFilter(tok, stopwords);
        return new TokenStreamComponents(standardTokenizer, tok);
    }

    private class MyStandardTokenizer extends Tokenizer {

        public boolean incrementToken() throws IOException {
            //mimic the logic of standard analyzer and add your rules
            return false;
        }
    }
}

I put everything into one class, just to make it easier to post here. In general, you need your own logic in MyStandardTokenizer (e.g. you could copy code from StandardAnalyzer (it's final, so no extends again) and then in the incrementToken add needed stuff for your logic with dashes. Hope it will help you.

Upvotes: 3

Related Questions