Apache Solr word level ngram

Question

I have to configure Solr for word level ngram (uni, bi and trigram). For example, if input (Index or query) is as follows:

"Welcome to Apache Solr" It should be tokenized as

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

How should I get this from Solr. I have consulted default guide of Solr, I have not find word level tokenizer.

Abhijit Bashetti · Accepted Answer

You can use the Shingle Filter here.

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

In: "To be, or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

you use the below property as well.

maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

I tried for the text you requested.

Here is the fieldtype applied.

The expected output is :

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

The output given after applying the above fieldtype is :

Here is covers all the expected tokens like

unigram : Welcome, to , Apache , Solr
bigram : Welcome to , to Apache, Apache Solr 
trigram : Welcome to Apache , to Apache Solr

For more details please refer the below link. Shingle Filter Example

Answers (1)