Reputation: 3297
I am developing rails app with sunspot Solr search engine and I'm in need of indexing phone numbers in Solr 4.1.
For example, if I have phone number "+12 (456) 789-0101", my page should be founded by queries:
.......(456) 789......... (middle part of phone in correct format)
124567890101 (full phone with numbers only)
I know that I can use:
EdgeNGramFilterFactory
for splitting phone to NGrams (front and back)WordDelimiterFilterFactory
for catenate numbers and splitting phone for parts. So, what I have done:
Create new Solr field type in shema.xml
:
<fieldType name="phone_number" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="20" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="20" side="back"/>
</analyzer>
</fieldType>
<dynamicField name="*_phone" stored="false" type="phone_number" multiValued="true" indexed="true"/>
Define searchable phone fields as '*_phone' type:
string :work_phone, :as => :work_phone, :stored => true do
work_phone.gsub(/\D/, '') if work_phone
end
string :mobile_phone, :as => :mobile_phone, :stored => true do
mobile_phone.gsub(/\D/, '') if mobile_phone
end
Run reindexing:
bundle exec rake sunspot:rebuild
But it does not work when reindexing finished, I can found results only searching wiht queries: "full phone" and "left part of phone". Search with "middle part of phone" and "right part of phone" doesn't give me any results.
Did I make somethig wrong? How to make phone part searing correctly? Please, help. thanks!
Upvotes: 1
Views: 1953
Reputation: 3297
Ectualy, it is my code, which works:
Schema.xml:
<fieldType class="solr.TextField" name="phone_number" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1"/>
</analyzer>
</fieldType>
<dynamicField name="*_phone" stored="false" type="phone_number" multiValued="false" indexed="true"/>
<dynamicField name="*_phones" stored="false" type="phone_number" multiValued="false" indexed="true"/>
And ruby code:
text :work_phone
text :work_phone_parts, :as => :work_phone do
"00#{work_phone.gsub(/\D/, '')}" if work_phone
end
text :mobile_phone
text :mobile_phone_parts, :as => :mobile_phone do
"00#{mobile_phone.gsub(/\D/, '')}" if mobile_phone
end
Upvotes: 2
Reputation: 9789
(commenting on Solr part only, not sure how SunSpot can map it)
There is a couple of things not quite right here:
Here is a good way to match suffixes, taking into account stripping all the random non-digit stuff and asymmetry of index/query (from my AirPair Solr tutorial):
<fieldType name="phone" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])" replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
Note that this will not help with queries that include spaces in them with default analyzer, as they will be broken up on space before they hit field analysis. If you know you are searching the phone number, you can either quote the search string or switch to a different (probably field) query parser.
If you do want to match the middle, maybe you don't want any of that and just want NGram, not EdgeNGram analysis.
Upvotes: 2