Exact document matching with ElasticSearch

Question

I need to query exactly against a set of "short documents". Example:

Documents:

{"name": "John Doe", "alt": "John W Doe"}
{"name": "My friend John Doe", "alt": "John A Doe"}
{"name": "John", "alt": "Susy"}
{"name": "Jack", "alt": "John Doe"}

Expected results:

If I search "John Doe", I want the score of 1 to be much bigger than the score of 2 and 4
If I search "John Doé", the same as above
If I search "John", i want to get 3 (exact match is better than repetition in name and alt)

Is it possible with ES? How can i achieve this? I tried boosting "name", but i can't find how to exactly match the document field, instead of searching inside of it.

DrTech · Accepted Answer

What you are describing is exactly how a search engine works by default. A search for "John Doe" becomes a search for the terms "john" and "doe". For each term, it looks for documents that contain the term, then assigns a _score to each document, based on:

how common the term is in all documents (more common == less relevant)
how common is the term within the field of the document (more common == more relevant)
how long is the field of the document (longer == less relevant)

The reason you are not seeing clear results is that Elasticsearch is distributed, and you are testing with small amounts of data. An index by default has 5 primary shards, and your docs are indexed on different shards. Each shard has its own doc frequency counts, so the scores are being distorted.

When you add real-world amounts of data, the frequencies even themselves out over shards, but for testing small amounts of data, you need to do one of two things:

create an index with only one primary shard, or
specify search_type=dfs_query_then_fetch which first fetches the frequencies from each shard before running the query using the global frequencies

To demonstrate, first index your data:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

Now, search for "john doe", remembering to specify dfs_query_then_fetch.

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doe"
      }
   }
}
'

Doc 1 is the first in the results:

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 8
# }

When you search for just "john":

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john"
      }
   }
}
'

Doc 3 appears first:

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 1,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 0.625,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.5,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

Ignoring accents

The second issue is that of matching "John Doé". This is an issue of analysis. In order to make full text more searchable, we analyse it into separate terms or tokens, which are what is stored in the index. In order to match eg john, John and JOHN when the user searches for john, each term/token is passed through a number of token filters, to put them into a standard form.

When we do a full text search, the search terms go through this exact same process. So if we have a document which contains John, this is indexed as john, and if the user searches for JOHN, we actually search for john.

In order to make Doé match doe, we need a token filter which removes accents, and we need to apply it both to the text being indexed, and to the search terms. The simplest way to do this is to use the ASCII folding token filter.

We can define a custom analyzer when we create an index, and we can specify in the mapping that a particular field should use that analyzer, both at index time and at search time.

First, delete the old index:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

Then create the index, specifying the custom analyzer and the mapping:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_accents" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "asciifolding"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   },
   "mappings" : {
      "test" : {
         "properties" : {
            "name" : {
               "type" : "string",
               "analyzer" : "no_accents"
            }
         }
      }
   }
}
'

Reindex the data:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

Now, test the search:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doé"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 6
# }

Exact document matching with ElasticSearch

Answers (2)

Ignoring accents

Related Questions