designermonkey
designermonkey

Reputation: 1118

Understanding Analyzers, Filters and Queries in Elasticsearch

I'm trying to get my head around when I should be using analyzers, filters and queries. I've read through the Search in Depth article on the elastic.co site, and have a better understanding, but the examples are naive to my use case, and still slightly confusing.

Given I have documents with an array of ingredients, containing any mix of digestive biscuits, biscuits, cheese, and chocolate, I am trying to figure out what is the best way to analyze that data, and perform a search on it.

Here's a simple set of documents:

[{
    "ingredients": ["cheese", "chocolate"]
}, {
    "ingredients": ["chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "biscuits"]
}, {
    "ingredients": ["chocolate", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}]

(I've intentionally not mixed biscuits and digestive biscuits here, I'll explain in a mo.)

I have one search field that will allow people to free type whatever ingredients they choose, and I currently split this out on whitespace to give me an array of terms to use.

I have the mapping as such:

{
    "properties": {
        "ingredients": {
            "type": "string",
            "analyzer": "keyword"
        }
    }
}

The problems I am facing here are that biscuits does not match digestive biscuits, and biscuit does not match anything.

I know I have to analyze the field with a snowball analyzer, but I am very unsure on how to do this.

Do I need a multi-field approach? Do I need to query with filters too? The results I would like to see are:

Also, throwing any other term in randomly, how do I handle that? Use a filter or a query?

Very puzzled by how to structure this right from index through mapping and search, so if anyone has any example advice, I'd greatly appreciate it.

Upvotes: 1

Views: 910

Answers (2)

Evaldas Buinauskas
Evaldas Buinauskas

Reputation: 14097

First of all, I'd suggest reading this: https://www.elastic.co/guide/en/elasticsearch/guide/current/stemming.html

It discusses exact problem you're facing.

So to fix this, you have to use custom analyzer (it's built using character filters, tokenizer and filters). Analyzer emits tokens from text blob.

So in your specific case, I'll show you how to create a simple custom analyzer to achieve what you want:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_custom": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "kstem"
          ]
        }
      }
    }
  },
  "mappings": {
    "data": {
      "properties": {
        "ingredients": {
          "type": "string",
          "analyzer": "my_analyzer_custom"
        }
      }
    }
  }
}

This analyzer will split your text using standard tokenizer and apply these filters:

  • asciifolding - normalizes letters with accent characters (é => e)
  • lowercase - lowercases tokens, so that searches are case insensitive
  • kstem - filter, that normalizes tokens to their root forms (not ideal, but does a good job). In this case it's going to normalize biscuits into biscuit

So there's your sample data:

PUT /test/data/1
{
  "ingredients": ["cheese", "chocolate"]
}
PUT /test/data/2
{
  "ingredients": ["chocolate", "biscuits"]
}
PUT /test/data/3
{
  "ingredients": ["cheese", "biscuits"]
}
PUT /test/data/4
{
  "ingredients": ["chocolate", "digestive biscuits"]
}
PUT /test/data/5
{
  "ingredients": ["cheese", "digestive biscuits"]
}
PUT /test/data/6
{
  "ingredients": ["cheese", "chocolate", "biscuits"]
}
PUT /test/data/7
{
  "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}

And this query:

GET /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.5,
      "queries": [
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "type": "phrase",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "operator": "and",
              "boost": 3
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits"
            }
          }
        }
      ]
    }
  }
}

I've used Dis Max Query in this case. You see that there's an array of queries? We're specifying multiple queries there and it brings back document with highest score. From documentation:

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

So in this case I've specified three queries:

  • Phrase Match. A query should match on terms and positions.
  • Match with "operator": "and", it means that all terms must match regardless of their order
  • A simple Match query. It means that any of tokens must match

You can see that for each of them I'm specifying different boost values - that's how you prioritize their importance.

I hope this helps.

Upvotes: 4

ChintanShah25
ChintanShah25

Reputation: 12672

This is how I would approach this problem. I created the index with following settings

POST food_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_possessive_stemmer",
            "light_english_stemmer",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "ingredients": {
          "type": "string",
          "analyzer": "my_custom_analyzer"
        }
      }
    }
  }
}
  • lowercase filter will lowercase all the words as the name suggests,this will help match Biscuits to biscuits
  • possessive_english removes 's from words so that we can match biscuit's to biscuit
  • light_english to stem the words. This is less aggressive and uses kstem token filter
  • asciifolding to handle diacritics(I dont think it is useful but it is up to you)

After that I inserted documents you provided in the questions. I think you need simple query string query. This will satisfy all your requirements as far as scoring of documents is concerned.

{
  "query": {
    "query_string": {
      "default_field": "ingredients",
      "query": "digestive biscuits"
    }
  }
}

This gave me exactly what you asked for. Please try these settings and query with your dataset and let me know if you face any issues.

I hope this helps!

Upvotes: 4

Related Questions