Filip
Filip

Reputation: 1244

Lucene bigrams tokenizer to include punctuation signs

Is there any chance that I could use Lucene's ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)? Quick example: given the field "one two; three four" would provide 2 bigrams only: (one two) and (three four)?

Upvotes: 2

Views: 477

Answers (1)

gavans
gavans

Reputation: 46

You could create a ShingleAnalyzerWrapper that uses an analyzer based on LetterTokenizer. LetterTokenizer breaks the input text at non letters. Something like:

public class MyCharAnalyzer extends Analyzer { 

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new LetterTokenizer(reader);    
    return result;
  }
}

ShingleAnalyzerWrapper myBigramWrapper = new ShingleAnalyzerWrapper(new MyCharAnalyzer());

If you wanted better control over what you consider punctuation, you could subclass CharTokenizer and override the isTokenChar() method.

Upvotes: 1

Related Questions