Reputation: 109
I'm using lucene 3.5 with SpanishAnalyzer (that itself uses SpanishStemmer and StandardTokenizer).
When SpanishAnalyzer index a document with the words (for example) "claramente" and "claro", they will be both indexed as "clar".
This behavior is understood and useful to my needs, today before querying I use the Analyzer's tokenStream
+ incrementToken()
to get the token of my search term and search that against the indexed document. I'm not using QueryParser but building lucene query objects in code.
however I want the ability to search the exact word (in this example claro) without losing the morphological abilities of the SpanishAnalyzer.
I can skip the step above (tokenStream) and search for "claro" directly but it will not be found as it is indexed as "clar".
Also I do not want to index the field twice with 2 different analyzers as I need to have the ability to use a PhraseQuery
or SpanNearQuery
containing one exact word and one regular term (morphological).
So… and I'm getting to the point… I thought to modify the Tokenizer or Stemmer or Filter (?) so on indexing time it will index 2 tokens for each word, the stemmed one and the original one, in this case "claro" and "clar" and later when querying I can choose whether to use the exact word or the stemmed token.
I need help understanding how (and where) I can do that, I guess the edit should be done somewhere in the Stemmer.
by the way, i do exactly the same with an Hebrew Analyzer that returns several tokens for each word in the text when using incrementToken()
(but i don't have the source code)
Upvotes: 2
Views: 1421
Reputation: 1758
You need a index with multiple token per position, because you want to search phrases with a mix of stemmed token and non-stemmed (=original) token. I will answer for version 5.3 but 3.5 was not very different.
Take a look to the source code of the ReversedWildcardFilter in solr. You will see the two steps on each token:
In the case of your SpanishAnalyzer this would mean e.g. the following:
The core of SpanishAnalyzer is the SpanishLightStemFilter. The SpanishLightStemFilter only stemmed Token with !KeywordAttribute.isKeyword(). So for index-time insert a KeywordRepeatFilter in SpanishAnalyzer and mark the stemmed token with a prefix.
Upvotes: 3
Reputation: 33341
There is a token filter which enables this pretty easily, the KeywordRepeatFilter
(SpanishLightStemFilter
does respect the KeywordAttribute). Simply add that into your analysis chain just before the Stemmer. For SpanishAnalyzer, the createComponents
method would look like this:
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source;
if (getVersion().onOrAfter(Version.LUCENE_4_7_0)) {
source = new StandardTokenizer();
} else {
source = new StandardTokenizer40();
}
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new KeywordRepeatFilter(result);
result = new SpanishLightStemFilter(result);
return new TokenStreamComponents(source, result);
}
This won't allow you explicitly search only unstemmed terms, but it will keep the original terms at the same positions as the stems, allowing them to be factored into phrase queries easily. If you do need to explicitly search only stemmed, or only unstemmed, terms then indexing in separate fields would really be the better approach.
Upvotes: 0