Lucene. index a few tokens for each word in the text

Question

I'm using lucene 3.5 with SpanishAnalyzer (that itself uses SpanishStemmer and StandardTokenizer).
When SpanishAnalyzer index a document with the words (for example) "claramente" and "claro", they will be both indexed as "clar".
This behavior is understood and useful to my needs, today before querying I use the Analyzer's tokenStream + incrementToken() to get the token of my search term and search that against the indexed document. I'm not using QueryParser but building lucene query objects in code.
however I want the ability to search the exact word (in this example claro) without losing the morphological abilities of the SpanishAnalyzer.
I can skip the step above (tokenStream) and search for "claro" directly but it will not be found as it is indexed as "clar".
Also I do not want to index the field twice with 2 different analyzers as I need to have the ability to use a PhraseQuery or SpanNearQuery containing one exact word and one regular term (morphological).
So… and I'm getting to the point… I thought to modify the Tokenizer or Stemmer or Filter (?) so on indexing time it will index 2 tokens for each word, the stemmed one and the original one, in this case "claro" and "clar" and later when querying I can choose whether to use the exact word or the stemmed token.
I need help understanding how (and where) I can do that, I guess the edit should be done somewhere in the Stemmer.

by the way, i do exactly the same with an Hebrew Analyzer that returns several tokens for each word in the text when using incrementToken() (but i don't have the source code)

Karsten R. · Accepted Answer

You need a index with multiple token per position, because you want to search phrases with a mix of stemmed token and non-stemmed (=original) token. I will answer for version 5.3 but 3.5 was not very different.

Take a look to the source code of the ReversedWildcardFilter in solr. You will see the two steps on each token:

store the current state with the original token. So the first call of the incrementToken()-method get to the stemmed token and the second call get to the original token (with the same position)
choose a "markerChar" as prefix for the stemmed token. So at search time you can decide if you want to search with stemmed or original token.

In the case of your SpanishAnalyzer this would mean e.g. the following:

The core of SpanishAnalyzer is the SpanishLightStemFilter. The SpanishLightStemFilter only stemmed Token with !KeywordAttribute.isKeyword(). So for index-time insert a KeywordRepeatFilter in SpanishAnalyzer and mark the stemmed token with a prefix.

Lucene. index a few tokens for each word in the text

Answers (2)

Related Questions