Understanding Analyzers, Filters and Queries in Elasticsearch

Question

I'm trying to get my head around when I should be using analyzers, filters and queries. I've read through the Search in Depth article on the elastic.co site, and have a better understanding, but the examples are naive to my use case, and still slightly confusing.

Given I have documents with an array of ingredients, containing any mix of digestive biscuits, biscuits, cheese, and chocolate, I am trying to figure out what is the best way to analyze that data, and perform a search on it.

Here's a simple set of documents:

[{
    "ingredients": ["cheese", "chocolate"]
}, {
    "ingredients": ["chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "biscuits"]
}, {
    "ingredients": ["chocolate", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}]

(I've intentionally not mixed biscuits and digestive biscuits here, I'll explain in a mo.)

I have one search field that will allow people to free type whatever ingredients they choose, and I currently split this out on whitespace to give me an array of terms to use.

I have the mapping as such:

{
    "properties": {
        "ingredients": {
            "type": "string",
            "analyzer": "keyword"
        }
    }
}

The problems I am facing here are that biscuits does not match digestive biscuits, and biscuit does not match anything.

I know I have to analyze the field with a snowball analyzer, but I am very unsure on how to do this.

Do I need a multi-field approach? Do I need to query with filters too? The results I would like to see are:

biscuit matching both biscuits and digestive biscuits (the latter being scored lower)
biscuits matching both biscuits and digestive biscuits (the latter being scored lower)
digestive matching digestive biscuits
digestive biscuits matching itself and biscuits (the latter being scored lower)

Also, throwing any other term in randomly, how do I handle that? Use a filter or a query?

Very puzzled by how to structure this right from index through mapping and search, so if anyone has any example advice, I'd greatly appreciate it.

Evaldas Buinauskas · Accepted Answer

First of all, I'd suggest reading this: https://www.elastic.co/guide/en/elasticsearch/guide/current/stemming.html

It discusses exact problem you're facing.

So to fix this, you have to use custom analyzer (it's built using character filters, tokenizer and filters). Analyzer emits tokens from text blob.

So in your specific case, I'll show you how to create a simple custom analyzer to achieve what you want:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_custom": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "kstem"
          ]
        }
      }
    }
  },
  "mappings": {
    "data": {
      "properties": {
        "ingredients": {
          "type": "string",
          "analyzer": "my_analyzer_custom"
        }
      }
    }
  }
}

This analyzer will split your text using standard tokenizer and apply these filters:

asciifolding - normalizes letters with accent characters (é => e)
lowercase - lowercases tokens, so that searches are case insensitive
kstem - filter, that normalizes tokens to their root forms (not ideal, but does a good job). In this case it's going to normalize biscuits into biscuit

So there's your sample data:

PUT /test/data/1
{
  "ingredients": ["cheese", "chocolate"]
}
PUT /test/data/2
{
  "ingredients": ["chocolate", "biscuits"]
}
PUT /test/data/3
{
  "ingredients": ["cheese", "biscuits"]
}
PUT /test/data/4
{
  "ingredients": ["chocolate", "digestive biscuits"]
}
PUT /test/data/5
{
  "ingredients": ["cheese", "digestive biscuits"]
}
PUT /test/data/6
{
  "ingredients": ["cheese", "chocolate", "biscuits"]
}
PUT /test/data/7
{
  "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}

And this query:

GET /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.5,
      "queries": [
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "type": "phrase",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "operator": "and",
              "boost": 3
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits"
            }
          }
        }
      ]
    }
  }
}

I've used Dis Max Query in this case. You see that there's an array of queries? We're specifying multiple queries there and it brings back document with highest score. From documentation:

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

So in this case I've specified three queries:

Phrase Match. A query should match on terms and positions.
Match with "operator": "and", it means that all terms must match regardless of their order
A simple Match query. It means that any of tokens must match

You can see that for each of them I'm specifying different boost values - that's how you prioritize their importance.

I hope this helps.

Understanding Analyzers, Filters and Queries in Elasticsearch

Answers (2)

Related Questions