SOLR EdgeNGramFilter return wrong response

Question

Query: mpn:"MEM-CF-512MB-AOK"

Solr response:

{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
  "fl": "id, mpn, name",
  "indent": "true",
  "q": "mpn:\"MEM-CF-512MB-AOK\"",
  "_": "1375801439480",
  "wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
  {
    "id": "1340120",
    "mpn": "MEM-CF-256MB-AOK",
    "name": "256MB CompactFlash"
  },
  {
    "id": "1340129",
    "mpn": "MEM-CF-512MB-AOK",
    "name": "512MB CompactFlash"
  }
  ]
},
"spellcheck": {
  "suggestions": [
  "correctlySpelled",
  true
]
 }
}

expected:

 {
        "id": "1340129",
        "mpn": "MEM-CF-512MB-AOK",
        "name": "512MB CompactFlash"
      }

I need search:

1)MEM-CF-512MB-AOK

2)MEM-CF-512MB

3)MEM-CF-512MB-AO

4)M-CF-512MB-AOK

5) -CF-512MB-AOK

schema.xml:

femtoRgon · Accepted Answer

LowercaseTokenizer is functionality equivalent to a LetterTokenizer and LowercaseFilter. Judging by the case you've provided, you don't want LetterTokenizer-like functionality, which will only index consecutive sets of letters. Effectively, before the Ngramming, you have the tokens:

mem, cf, mb, aok

I think what you want is a KeywordTokenizer and LowercaseFilter

Since you want to be able to search with missing characters at the end as well as the beginning, you need to perform a prefix query. An EdgeNgramTokenizer only produces NGrams taking characters off the front, such as:

mem-cf-512mb-aok, em-cf-512mb-aok, m-cf-512mb-aok, -cf-512mb-aok

So, to pick up matches with missing characters at the end, a simple prefix search should work, like:

m-cf-512mb-a*

minGramSize="1" is almost certainly overzealous. You don't likely want 1-grams (ie. matching just "k"). Your minimal case above would is 12 in length, for instance. I'll guess 5 for a reasonable min gram size.

And again, you should use queries appended with a trailing wildcard.

SOLR EdgeNGramFilter return wrong response

Answers (2)

Related Questions