d.garanzha
d.garanzha

Reputation: 813

SOLR EdgeNGramFilter return wrong response

Query: mpn:"MEM-CF-512MB-AOK"

Solr response:

{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
  "fl": "id, mpn, name",
  "indent": "true",
  "q": "mpn:\"MEM-CF-512MB-AOK\"",
  "_": "1375801439480",
  "wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
  {
    "id": "1340120",
    "mpn": "MEM-CF-256MB-AOK",
    "name": "256MB CompactFlash"
  },
  {
    "id": "1340129",
    "mpn": "MEM-CF-512MB-AOK",
    "name": "512MB CompactFlash"
  }
  ]
},
"spellcheck": {
  "suggestions": [
  "correctlySpelled",
  true
]
 }
}

expected:

 {
        "id": "1340129",
        "mpn": "MEM-CF-512MB-AOK",
        "name": "512MB CompactFlash"
      }

I need search:

1)MEM-CF-512MB-AOK

2)MEM-CF-512MB

3)MEM-CF-512MB-AO

4)M-CF-512MB-AOK

5) -CF-512MB-AOK

schema.xml:

<field name="mpn" type="text_general_edge_ngram" indexed="true" stored="true"/>

<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
   </analyzer>
</fieldType>

Upvotes: 1

Views: 330

Answers (2)

femtoRgon
femtoRgon

Reputation: 33341

LowercaseTokenizer is functionality equivalent to a LetterTokenizer and LowercaseFilter. Judging by the case you've provided, you don't want LetterTokenizer-like functionality, which will only index consecutive sets of letters. Effectively, before the Ngramming, you have the tokens:

mem, cf, mb, aok

I think what you want is a KeywordTokenizer and LowercaseFilter

Since you want to be able to search with missing characters at the end as well as the beginning, you need to perform a prefix query. An EdgeNgramTokenizer only produces NGrams taking characters off the front, such as:

mem-cf-512mb-aok, em-cf-512mb-aok, m-cf-512mb-aok, -cf-512mb-aok

So, to pick up matches with missing characters at the end, a simple prefix search should work, like:

m-cf-512mb-a*

minGramSize="1" is almost certainly overzealous. You don't likely want 1-grams (ie. matching just "k"). Your minimal case above would is 12 in length, for instance. I'll guess 5 for a reasonable min gram size.

<analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowercaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="5" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowercaseFilterFactory"/>
</analyzer>

And again, you should use queries appended with a trailing wildcard.

Upvotes: 2

Srikanth Venugopalan
Srikanth Venugopalan

Reputation: 9049

The scenario you've described looks like an exact match on mpn field.

However, you've defined mpn as Edge-NGram with mingram=1. This will start indexing 1-gram onwards. Which isn't what you would need, I imagine.

In order to get this sorted, I guess you could have another field (if you want NGram for another reason ) and have your exact query match against it. Ex

mpn_exact:"MEM-CF-512MB-AOK"

You could test this out by using the Analysis component of your Admin console.

Upvotes: 0

Related Questions