Reputation: 1276
I need to query exactly against a set of "short documents". Example:
Documents:
Expected results:
Is it possible with ES? How can i achieve this? I tried boosting "name", but i can't find how to exactly match the document field, instead of searching inside of it.
Upvotes: 2
Views: 2099
Reputation: 17319
What you are describing is exactly how a search engine works by default. A search for "John Doe"
becomes a search for the terms "john"
and "doe"
. For each term, it looks for documents that contain the term, then assigns a _score
to each document, based on:
The reason you are not seeing clear results is that Elasticsearch is distributed, and you are testing with small amounts of data. An index by default has 5 primary shards, and your docs are indexed on different shards. Each shard has its own doc frequency counts, so the scores are being distorted.
When you add real-world amounts of data, the frequencies even themselves out over shards, but for testing small amounts of data, you need to do one of two things:
search_type=dfs_query_then_fetch
which first fetches the frequencies from each shard before running the query using the global frequenciesTo demonstrate, first index your data:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
Now, search for "john doe"
, remembering to specify dfs_query_then_fetch
.
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doe"
}
}
}
'
Doc 1 is the first in the results:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 8
# }
When you search for just "john"
:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john"
}
}
}
'
Doc 3 appears first:
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 1,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 0.625,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.5,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# }
# ],
# "max_score" : 1,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 5
# }
The second issue is that of matching "John Doé
". This is an issue of analysis. In order to make full text more searchable, we analyse it into separate terms or tokens, which are what is stored in the index. In order to match eg john
, John
and JOHN
when the user searches for john
, each term/token is passed through a number of token filters, to put them into a standard form.
When we do a full text search, the search terms go through this exact same process. So if we have a document which contains John
, this is indexed as john
, and if the user searches for JOHN
, we actually search for john
.
In order to make Doé
match doe
, we need a token filter which removes accents, and we need to apply it both to the text being indexed, and to the search terms. The simplest way to do this is to use the ASCII folding token filter.
We can define a custom analyzer when we create an index, and we can specify in the mapping that a particular field should use that analyzer, both at index time and at search time.
First, delete the old index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
Then create the index, specifying the custom analyzer and the mapping:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"no_accents" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "no_accents"
}
}
}
}
}
'
Reindex the data:
curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1' -d '
{
"alt" : "John W Doe",
"name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{
"alt" : "John A Doe",
"name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1' -d '
{
"alt" : "Susy",
"name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1' -d '
{
"alt" : "John Doe",
"name" : "Jack"
}
'
Now, test the search:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch' -d '
{
"query" : {
"match" : {
"name" : "john doé"
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "alt" : "John W Doe",
# "name" : "John Doe"
# },
# "_score" : 1.0189849,
# "_index" : "test",
# "_id" : "1",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "John A Doe",
# "name" : "My friend John Doe"
# },
# "_score" : 0.81518793,
# "_index" : "test",
# "_id" : "2",
# "_type" : "test"
# },
# {
# "_source" : {
# "alt" : "Susy",
# "name" : "John"
# },
# "_score" : 0.3066778,
# "_index" : "test",
# "_id" : "3",
# "_type" : "test"
# }
# ],
# "max_score" : 1.0189849,
# "total" : 3
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 6
# }
Upvotes: 5
Reputation: 1621
I think you will achieve what you need if you map as multiple fields, and boost non-analyzed field:
"name": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed",
"boost": "1.1"
},
"name": {
"include_in_all": true,
"type": "string",
"index": "analyzed",
"search_analyzer": "someanalyzer",
"index_analyzer": "someanalyzer"
}
}
}
You could also boost query-time instead of indextime if you need flexibility, by using the '^'-notation in query_string
{
"query_string" : {
"fields" : ["name, name.untouched^5"],
"query" : "this AND that OR thus",
}
}
Upvotes: 2