bitcodr
bitcodr

Reputation: 1454

Elasticsearch query and sorting by parameters

How can I query and sort text by below parameter in elasticsearch

1 - search query be exact in the first part of results

2 - search query be exact in another part of the result

3 - results contain all words of the search query

For example :

When I search: i love dogs

Results respectively must be :

1-  I love dogs

2 - i love dogs and birds

3 - birds good but i love dogs and horses 

4 - Horses and i love dogs

5 - I love horses and dogs

6 - good dogs and i love horses

Upvotes: 3

Views: 1188

Answers (2)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6066

It is possible to achieve the desired behavior, but it will require quite some tweaking of your mapping and the query.

To cut the story short, here's the working query

First, here's the mapping:

PUT my_phrase_search
{
  "mappings": {
    "doc": {
      "properties": {
        "expected_position": {
          "type": "long"
        },
        "my_phrase": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256,
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "normalizer": {
          "my_normalizer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom"
          }
        }
      }
    }
  }
}

Note: I added field expected_position to make evaluation of the results easier.

Now, the query:

POST my_phrase_search/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "should": [
              {
                "prefix": {
                  "my_phrase.keyword": "i love dogs"
                }
              }
            ],
            "_name": "prefix",
            "boost": 2
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "my_phrase": "i love dogs"
                }
              }
            ],
            "_name": "match"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "my_phrase": "i love dogs"
                }
              }
            ],
            "_name": "phrase",
            "boost": 2
          }
        }
      ]
    }
  }
}

This gives the following results:

[
  {
    "_score": 4.015718,
    "_source": {
      "my_phrase": "I love dogs",
      "expected_position": 1
    },
    "matched_queries": [
      "match",
      "phrase",
      "prefix"
    ]
  },
  {
    "_score": 3.233316,
    "_source": {
      "my_phrase": "i love dogs and birds",
      "expected_position": 2
    },
    "matched_queries": [
      "match",
      "phrase",
      "prefix"
    ]
  },
  {
    "_score": 1.3836111,
    "_source": {
      "my_phrase": "birds good but i love dogs and horses ",
      "expected_position": 3
    },
    "matched_queries": [
      "match",
      "phrase"
    ]
  },
  {
    "_score": 1.2333161,
    "_source": {
      "my_phrase": "Horses and i love dogs",
      "expected_position": 4
    },
    "matched_queries": [
      "match",
      "phrase"
    ]
  },
  {
    "_score": 0.8630463,
    "_source": {
      "my_phrase": "I love horses and dogs",
      "expected_position": 5
    },
    "matched_queries": [
      "match"
    ]
  },
  {
    "_score": 0.38110584,
    "_source": {
      "my_phrase": "good dogs and i love horses",
      "expected_position": 6
    },
    "matched_queries": [
      "match"
    ]
  }
]

You may wonder, how does it work? Are all these changes necessary? Let's find out.

What if we just use text field and match query?

The match query would look like this:

POST my_phrase/doc/_search
{
  "query": {
    "match": {
      "my_phrase": "i love dogs"
    }
  }
}

This will give us the following order of the results: 5 - 1 - 3 - 2 - 4 - 6.

The question is, why query for "i love dogs" did not return a perfect match, 1- I love dogs, as the first result? Why 5 - I love horses and dogs came first?

In this case the answer is avgFieldLength which is used for computation of the score, it is computed per shard and thus can be slightly different for different documents.

It is pretty obvious that ES should give us results that start with our query. How can we tell ES to prefer such documents?

Adding prefix search on keyword field

We can use prefix query united with match query via bool query (which can be roughly interpreted as an OR in this case), like this:

POST my_phrase/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "prefix": {
            "my_phrase.keyword": "i love dogs"
          }
        },
        {
          "match": {
            "my_phrase": "i love dogs"
          }
        }
      ]
    }
  }
}

Note that prefix query only works with keyword type, since it needs to interpret the document as one big token.

This query gives us the following order of the results: 2 - 5 - 1 - 3 - 4 - 6.

2 jumped up, but 1 did not. Why did it happen?

Here the case of the characters comes into play: keyword data type is not analyzed and thus i or I will make a difference for this prefix search.

How can we make keyword case-insensitive?

Making keyword case-insesitive

This is achieved by defining a normalizer in the mapping:

PUT my_phrase2
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "my_phrase": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256,
                "normalizer": "my_normalizer"
              }
            }
          }
      }
    }
  }
}

The same query will now give us the following order: 1 - 2 - 5 - 3 - 4 - 6.

This is already pretty good, but 5 - I love horses and dogs is still too high – higher that 3 - birds good but i love dogs and horses with exact phrase match.

match query does not care about the order of words in the phrase. Can we boost the documents that have the correct order?

Adding match_phrase to boost phrase matching

There is match_phrase query that does favor tokens in the original order. Let's use it in the query:

POST my_phrase2/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "prefix": {
            "my_phrase.keyword": "i love dogs"
          }
        },
        {
          "match_phrase": {
            "my_phrase": "i love dogs"
          }
        },
        {
          "match": {
            "my_phrase": "i love dogs"
          }
        }
      ]
    }
  }
}

This gives us the following order: 1 - 2 - 3 - 5 - 4 - 6.

3 popped up! But 5 - I love horses and dogs is still higher than 4 - Horses and i love dogs. Looks like phrase match should have favored the 4.

The query has become quite complex, let's find out which parts of it the documents actually matched.

Adding names to the queries

It is possible to give names to queries so to understand which parts of a complex one actually took effect:

POST my_phrase2/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "should": [
              {
                "prefix": {
                  "my_phrase.keyword": "i love dogs"
                }
              }
            ],
            "_name": "prefix"
          }
        },
...

The response for the documents of interest will give us:

  {
    "_score": 0.8630463,
    "_source": {
      "my_phrase": "I love horses and dogs",
      "expected_position": 5
    },
    "matched_queries": [
      "match"
    ]
  },
  {
    "_score": 0.82221067,
    "_source": {
      "my_phrase": "Horses and i love dogs",
      "expected_position": 4
    },
    "matched_queries": [
      "match",
      "phrase"
    ]
  },

Doc 5 did not match the phrase part. Looks like score fluctuations hit us again.

Phrase query looks more relevant, is there a way to boost it?

Finally: boosting the phrase and prefix queries

There is a way to affect the computation of the score, telling ES that certain parts of the query are more important, called boost. Here's how it might look like:

POST my_phrase2/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "should": [
              {
                "prefix": {
                  "my_phrase.keyword": "i love dogs"
                }
              }
            ],
            "_name": "prefix",
            "boost": 2
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "my_phrase": "i love dogs"
                }
              }
            ],
            "_name": "match"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "my_phrase": "i love dogs"
                }
              }
            ],
            "_name": "phrase",
            "boost": 2
          }
        }
      ]
    }
  }
}

This one gives us the desired order of results: 1 - 2 - 3 - 4 - 5 - 6.

Note that we boosted also the prefix query because we wanted to lower the importance of match.

Is this approach safe, or Overfitting warning

Although this query does the job, you might want to perform great deal of real-world validation and further tweaking in order to assure adequate search results.

The query that fits perfectly those 6 documents might not fit a large real-world collection, please take this answer as a start for your optimization.

As you can see, not all the parts of the query are necessary: names of queries can be easily omitted, but serve as good aid in understanding how a document was matched.

Upvotes: 3

Muhammad Zubair Saleem
Muhammad Zubair Saleem

Reputation: 517

To get your desired results you need to use match_phrase_prefix with parameters like max_expansions example below for further read.

match_phrase_prefix

GET /_search
{
    "query": {
        "match_phrase_prefix" : {
            "message" : "quick brown f"
        }
    }
}

Upvotes: 0

Related Questions