Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

Apache Solr word level ngram

I have to configure Solr for word level ngram (uni, bi and trigram). For example, if input (Index or query) is as follows:

"Welcome to Apache Solr" It should be tokenized as

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

How should I get this from Solr. I have consulted default guide of Solr, I have not find word level tokenizer.

Upvotes: 1

Views: 813

Answers (1)

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8658

You can use the Shingle Filter here.

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.ShingleFilterFactory"/>
</analyzer>

In: "To be, or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

you use the below property as well.

maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

I tried for the text you requested.

Here is the fieldtype applied.

<fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"/>
        </analyzer>
    </fieldType>

The expected output is :

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

The output given after applying the above fieldtype is : Solr Analysis Page

Here is covers all the expected tokens like

unigram : Welcome, to , Apache , Solr
bigram : Welcome to , to Apache, Apache Solr 
trigram : Welcome to Apache , to Apache Solr

For more details please refer the below link. Shingle Filter Example

Upvotes: 2

Related Questions