Reputation: 8670
I have to configure Solr for word level ngram (uni, bi and trigram). For example, if input (Index or query) is as follows:
"Welcome to Apache Solr" It should be tokenized as
Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"
How should I get this from Solr. I have consulted default guide of Solr, I have not find word level tokenizer.
Upvotes: 1
Views: 813
Reputation: 8658
You can use the Shingle Filter here.
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
you use the below property as well.
maxShingleSize :
(integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.
I tried for the text you requested.
Here is the fieldtype applied.
<fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"/>
</analyzer>
</fieldType>
The expected output is :
Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"
The output given after applying the above fieldtype is :
Here is covers all the expected tokens like
unigram : Welcome, to , Apache , Solr
bigram : Welcome to , to Apache, Apache Solr
trigram : Welcome to Apache , to Apache Solr
For more details please refer the below link. Shingle Filter Example
Upvotes: 2