Karthikeyan
Karthikeyan

Reputation: 2001

ElasticSearch - Match Query with fuzziness searching alphanumeric

Using Match Query with fuzziness and querying alphanumeric term and the results is not coming properly.

Please find my below query that am running in kibana

GET index_name/_search
{
    "query": {
        "match" : {
            "values" : {
                "query" : "A661752110",
                "operator" : "and",
                "fuzziness": 1,
                "boost": 1.0,
                "prefix_length": 0,
                "max_expansions": 100

          }
        }
    }
}

Am expecting results as below :

A661752110
A66175211012
A661752110111
A661752110-12
A661752110-111

But am getting results like :

A661752110
A661752111
A661752119

Please find my mapping details :

PUT index_name
{
    "settings": {
        "analysis": {
            "analyzer": {
                "attr_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "char_filter": [
                        "html_strip"
                    ],
                    "filter": ["lowercase", "asciifolding"]
                }
            }
        }
    },

    "mappings": {
        "doc": {
            "properties": {
                "values": {
                    "type": "text",
                    "analyzer": "attr_analyzer"
                },
                "id":{
                    "type": "text"
                }
            }
        }
    }
}

Upvotes: 0

Views: 989

Answers (1)

Benoit Guigal
Benoit Guigal

Reputation: 858

Fuzzy matching allows to treat two words that are "fuzzily" similar as if they were the same word. Elasticsearch uses the Damareau-Levenshtein distance to measure the similarity of two strings. The Damareau-Levenshtein distance measures the number of single character edit to a string, allowing four kind of edits:

  • Substitution of one character for another: _f_ox → _b_ox
  • Insertion of a new character: sic → sic_k_
  • Deletion of a character: b_l_ack → back
  • Transposition of two adjacent characters: _st_ar → _ts_ar

The edit distance is controlled in the search request with the fuzziness parameter. You specified a fuzziness of 1 which means Elasticsearch will only returns strings obtained by performing one edit (substitution, insertion, deletion or transposition) to "A661752110".

The words you were expecting that did not show up have an edit distance strictly greater than 1. Please note that in Elasticsearch the max value authorized is 2.

Some suggestions to achieve what you want:

  • If you want A661752110-12 and A661752110-111 to match. You can use a tokenizer that splits text when it finds a -. This is what the standard tokenizer does for example.
  • If you further want A66175211012and A661752110111, the best choice will be to use a regexp query like this

{ "query": { "regexp": { "values": { "value": "A661752110.{,3}" } } } }

Upvotes: 1

Related Questions