Obsidian Phoenix
Obsidian Phoenix

Reputation: 4155

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.

For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.

In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.

For Instance, this query returns a user record as expected:

{
    "from": 0,
    "size": 1,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "firstName": {
                            "query": "sVxGBCkPYZ",
                            "boost": 30
                        }
                    }
                }
            ],
            "should": [

            ]
        }
    },
    "fields": [
        "id",
        "firstName"
    ]
}

However replacing the match element with the below fails to return any records:

{
    "fuzzy": {
        "firstName": {
            "value": "sVxGBCkPYZ",
            "fuzziness": 2,
            "boost": 30,
            "min_similarity": 0.3
        }
    }
}

Why would this be happening, and is there anything I can do to remedy the situation?

For reference. This is the ES version i'm currently using:

"version": {
    "number": "1.7.1",
    "build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
    "build_timestamp": "2015-07-29T09:54:16Z",
    "build_snapshot": false,
    "lucene_version": "4.10.4"
}

Upvotes: 0

Views: 1126

Answers (1)

Kamal Kunjapur
Kamal Kunjapur

Reputation: 8860

The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.

You can instead, implement fuzziness with match query as below:

POST testindex/_search
{  
   "query":{  
      "match":{  
         "firstname":{  
            "query":"sVxGBCkPYZ",
            "fuzziness":"AUTO"
         }
      }
   }
}

You can change the value from AUTO to 2 or 3 depending on your use case.

The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.

As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:

The fuzzy query works by taking the original term and building a Levenshtein automaton—like a big graph representing all the strings that are within the specified edit distance of the original string.

The fuzzy query then uses the automaton to step efficiently through all of the terms in the term dictionary to see if they match. Once it has collected all of the matching terms that exist in the term dictionary, it can compute the list of matching documents.

Of course, depending on the type of data stored in the index, a fuzzy query with an edit distance of 2 can match a very large number of terms and perform very badly.

Note this statement in particular, representing all the strings that are within the specified edit distance of the original string

For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.

So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7

Adding one more LINK for more info. Hope it helps!

Upvotes: 1

Related Questions