Reputation: 1420
One of the ways to go about Google-like auto-completion is to combine shingles and the termvector component in Solr 1.4.
First we generate all n-gram distributions with the shingles component and then use termvector to get the closest prediction to a user's term's sequence (based on document frequency).
Schema:
<fieldType name="shingle_text_fivegram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Solr config:
<searchcomponent name="termsComponent" class="org.apache.solr.handler.component.TermsComponent"/>
<requesthandler name="/terms" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<bool name="terms">true</bool>
<str name="terms.fl">shingleContent_fivegram</str>
</lst>
<arr name="components">
<str>termsComponent</str>
</arr>
</requesthandler>
With the above setup I need to drop stopwords anywhere on the edges of n-grams and keep them inside the n-gram sequence.
Let's say from the sequence "india and china" I need the following sequence:
india
china
india and china
and skip the rest.
Is it doable in combination with other Solr components/filters?
UPD: here is one possible solution in Lucene 4 (should be possible to wire into SOLR):
"Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?" -- Michael McCandless
from: http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html
Upvotes: 3
Views: 2139
Reputation: 1308
Use a separate query analyzer with the KeywordTokenizerFactory, thus (using your example):
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
Upvotes: 1
Reputation: 11
The best way to do multi-word auto-complete in Solr 1.4 is with EdgeNGramFilterFactory, as you need to match the user input as he/she types it. So you need to match "i", "in" "ind" and so on to suggest India.
Upvotes: 1