Andrey Kryachkov
Andrey Kryachkov

Reputation: 901

Partial search with sunspot

Given I have a model

class Firm < ActiveRecord::Base
  searchable do
    text :name
  end
end

And solr's schema.xml contains

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="30"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

And I have a Firm with name == 'Ойл-М (Oil-M)'

When I try to search

Sunspot.search(Firm) do
  fulltext 'Ойл-М'
end

Then I get nothing

When I try to search

Sunspot.search(Firm) do
  fulltext 'Ойл'
end

Then I get needed Firm

How should I set up Solr and/or search to be able to find this Firm by both queries?

Upvotes: 0

Views: 285

Answers (1)

Patricia
Patricia

Reputation: 934

Your NGramFilter is cutting off the final 'M', because you have minGramSize=2. Setting minGramSize=1 will work, but this greatly increases the size of data Solr will have to store, and also drives up noise.

When you index and query a field in Solr, two things happen:

  1. The field is split up into smaller pieces (tokenized),
  2. Each token is then filtered.

This happens separately for indexing and querying.

In this case, you are indexing the field with StandardTokenizerFactory, StandardFilter, LowercaseFilter, and an NGramFilter, and querying the field with everything except for the NGramFilter.

Here's what's happening when you index "Ойл-М (Oil-M)" into Solr.

StandardTokenizerFactory: ['Ойл', 'М', 'Oil', 'M']
StandardFilter: ['Ойл', 'М', 'Oil', 'M']
LowerCaseFilter: ['ойл', 'м', 'oil', 'm']
NGramFilter: ['ой', 'йл', 'ойл', 'oi', 'il', 'oil']

The 'm' drops away completely. Searching for "Ойл-М" returns nothing, because there is no M to search.

Cut out the NGramFilter unless you have a very good reason to use it, and stick with the standard Russian fieldType.

<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">                                                            
  <analyzer>                                                                                                                            
    <tokenizer class="solr.StandardTokenizerFactory"/>                                                                                  
    <filter class="solr.LowerCaseFilterFactory"/>                                                                                       
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" enablePositionIncrements="~
    <filter class="solr.SnowballPorterFilterFactory" language="Russian"/>                                                               
  </analyzer>                                                                                                                           
</fieldType> 

NOTE: Notice that there is no distinction here between the index analyzer and query analyzer. Each query is transformed in the exact same manner as when indexed.

Upvotes: 2

Related Questions