Chris
Chris

Reputation: 1118

Solr search dash in part number

I'm having some difficulties with either how to construct the Solr query, or how to setup the schema to get searches in our web store to work better.

First some configuration (Solr 4.2.1)

<field name="mfgpartno" type="text_en_splitting_tight" indexed="true" stored="true" />
<field name="mfgpartno_sort" type="string" indexed="true" stored="false" />
<field name="mfgpartno_search" type="sku_partial" indexed="true" stored="true" />

<copyField source="mfgpartno" dest="mfgpartno_sort" />
<copyField source="mfgpartno" dest="mfgpartno_search" />

<fieldType name="sku_partial" class="solr.TextField" omitTermFreqAndPositions="true">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
        <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="100" side="front" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    </analyzer>
</fieldType>

Let me break this down into stages (I'm only going to go into enough to replicate the problem - the initial stages aren't using edismax, that is what we've chosen to use on our website):

  1. q=DV\-5PBRP <- With this query I get 18 results but, not the one I'm looking for (this is most likely do to the default df searching on the productname field - fine)
  2. q=mfgpartno_search:DV\-5PBRP <- this gives me the 1 result I'm looking for, but due to the query building I need to do on the website it's better if I can use the q parameter like stage 1.
  3. q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search <- this also gives me the 1 result I'm looking for, but again due to the website search qf needs to be spanning more fields. Because it needs to search more fields (actual qf = productname_search shortdesc_search fulldesc_search mfgpartno_search productname shortdesc fulldesc keywords) to get more accurate searching I implemented stage 4.
  4. q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND <- with this test I get 0 results - though this works great for most searches on our site.

My big problem with search has been the special characters like the dash that sometimes must be literal, and sometimes act as separators as in product names or descriptions. Sometimes people will even search or replace the dash with a space on a part number search and it should still show relevant data.

I'm kind of stuck on how to get this special character search working - especially as it pertains to this mfgpartno_search field. How might I configure either the schema or query (or both) to get this working?

Upvotes: 6

Views: 1528

Answers (3)

Chris
Chris

Reputation: 1118

Ok, I think the problem was being over-thought.

I had assumed (based on my config) that the example part number might be indexed like so:

DV-5PBRP -> {DV 5PBRP, DV5PBRP, DV-5PBRP} + NGrams

I had also assumed doing a search on "DV-5PBRP" (literal dash) would match that third option (using a query like #4 in my question).

Yesterday I was alerted to this problem by the same user again, and I got to thinking let's try removing the separator in the search. So now the search has become:

q=DV5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND

I got the result I was looking for, which means that my solr config is at least giving me an index like the second index option.

Now, I've started trimming separator characters from user input before submitting the search to SOLR. This seems to work beautifully!

Upvotes: 0

JESTIN6699
JESTIN6699

Reputation: 49

If you are using HTTP get method please encode it and send using

URLEncoder.encode(searchWord,"UTF-8")

This is in the case of java. If you are not using java try corresponding encode code. This will help us to avoid "space", "/" like problems

Upvotes: 0

claj
claj

Reputation: 5402

Maybe you could try the Regular Expression Pattern Tokenizer, and make a suitable regular expression for you article numbers. Lucene (which Solr is built upon) is very focused on tokenization for prose.

What you want here is probably an N-gram split, as well as 1-grams? And maybe that dashes are replaced with spaces, something like

DV-5PBRP -> {DV 5PBRP, DV, 5P, BR, PB, RP, D, V, 5, P, B, R}

As you can see, the index will be quite large for very small fields. Make sure the ranking of the results are heavily weighted for the larger ngrams.

I do think you should remove the stop word list for the article numbers field.

The N-gram size should probably start at 1 or 2.

Simply make sure the various analyzers doesn't:

  • swallow the dash
  • remove single or few characters (these are often in stop word lists)
  • removes numbers

Upvotes: 1

Related Questions