D_K
D_K

Reputation: 1420

Autocomplete via shingles and termvector component

One of the ways to go about Google-like auto-completion is to combine shingles and the termvector component in Solr 1.4.

First we generate all n-gram distributions with the shingles component and then use termvector to get the closest prediction to a user's term's sequence (based on document frequency).

Schema:

<fieldType name="shingle_text_fivegram" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
        <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Solr config:

<searchcomponent name="termsComponent" class="org.apache.solr.handler.component.TermsComponent"/>
<requesthandler name="/terms" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
        <bool name="terms">true</bool>
        <str name="terms.fl">shingleContent_fivegram</str>
    </lst>
    <arr name="components">
        <str>termsComponent</str>
    </arr>
</requesthandler>

With the above setup I need to drop stopwords anywhere on the edges of n-grams and keep them inside the n-gram sequence.

Let's say from the sequence "india and china" I need the following sequence:

india
china
india and china

and skip the rest.

Is it doable in combination with other Solr components/filters?

UPD: here is one possible solution in Lucene 4 (should be possible to wire into SOLR):

"Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?" -- Michael McCandless

from: http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html

Upvotes: 3

Views: 2139

Answers (2)

James Doepp - pihentagyu
James Doepp - pihentagyu

Reputation: 1308

Use a separate query analyzer with the KeywordTokenizerFactory, thus (using your example):

        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>

Upvotes: 1

The best way to do multi-word auto-complete in Solr 1.4 is with EdgeNGramFilterFactory, as you need to match the user input as he/she types it. So you need to match "i", "in" "ind" and so on to suggest India.

Upvotes: 1

Related Questions