bmalets
bmalets

Reputation: 3297

Search part of phone number with Sunspot Solr

I am developing rails app with sunspot Solr search engine and I'm in need of indexing phone numbers in Solr 4.1.

For example, if I have phone number "+12 (456) 789-0101", my page should be founded by queries:

I know that I can use:

So, what I have done:

  1. Create new Solr field type in shema.xml:

    <fieldType name="phone_number" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="20" side="front"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="20" side="back"/> </analyzer> </fieldType>

    <dynamicField name="*_phone" stored="false" type="phone_number" multiValued="true" indexed="true"/>

  2. Define searchable phone fields as '*_phone' type:

    string :work_phone, :as => :work_phone, :stored => true do work_phone.gsub(/\D/, '') if work_phone end

    string :mobile_phone, :as => :mobile_phone, :stored => true do mobile_phone.gsub(/\D/, '') if mobile_phone end

  3. Run reindexing:

    bundle exec rake sunspot:rebuild

    But it does not work when reindexing finished, I can found results only searching wiht queries: "full phone" and "left part of phone". Search with "middle part of phone" and "right part of phone" doesn't give me any results.

Did I make somethig wrong? How to make phone part searing correctly? Please, help. thanks!

Upvotes: 1

Views: 1953

Answers (2)

bmalets
bmalets

Reputation: 3297

Ectualy, it is my code, which works:

Schema.xml:

    <fieldType class="solr.TextField" name="phone_number" positionIncrementGap="100">       
    <analyzer type="index">         
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>         
      <filter class="solr.LowerCaseFilterFactory"/>         
      <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/>
    </analyzer>       
    <analyzer type="query">         
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>         
      <filter class="solr.LowerCaseFilterFactory"/>         
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1"/>       
    </analyzer>     
    </fieldType>

 <dynamicField name="*_phone"  stored="false"  type="phone_number" multiValued="false" indexed="true"/>
 <dynamicField name="*_phones" stored="false"  type="phone_number" multiValued="false" indexed="true"/>

And ruby code:

  text :work_phone

  text :work_phone_parts, :as => :work_phone do
    "00#{work_phone.gsub(/\D/, '')}" if work_phone
  end

  text :mobile_phone

  text :mobile_phone_parts, :as => :mobile_phone do
    "00#{mobile_phone.gsub(/\D/, '')}" if mobile_phone
  end

Upvotes: 2

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

(commenting on Solr part only, not sure how SunSpot can map it)

There is a couple of things not quite right here:

  1. side=back is no longer an option since Solr 4.4, so you are probably just getting two copies of the same filter
  2. Having two copies of the same filter is bad anyway, as the second one will look at all the tokens issued by the first and things will get messy.

Here is a good way to match suffixes, taking into account stripping all the random non-digit stuff and asymmetry of index/query (from my AirPair Solr tutorial):

<fieldType name="phone" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/>
    <filter class="solr.ReverseStringFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/>
    <filter class="solr.ReverseStringFilterFactory"/>
  </analyzer>
</fieldType>

Note that this will not help with queries that include spaces in them with default analyzer, as they will be broken up on space before they hit field analysis. If you know you are searching the phone number, you can either quote the search string or switch to a different (probably field) query parser.

If you do want to match the middle, maybe you don't want any of that and just want NGram, not EdgeNGram analysis.

Upvotes: 2

Related Questions