Reputation: 1118
I'm trying to get my head around when I should be using analyzers, filters and queries. I've read through the Search in Depth article on the elastic.co site, and have a better understanding, but the examples are naive to my use case, and still slightly confusing.
Given I have documents with an array of ingredients, containing any mix of digestive biscuits
, biscuits
, cheese
, and chocolate
, I am trying to figure out what is the best way to analyze that data, and perform a search on it.
Here's a simple set of documents:
[{
"ingredients": ["cheese", "chocolate"]
}, {
"ingredients": ["chocolate", "biscuits"]
}, {
"ingredients": ["cheese", "biscuits"]
}, {
"ingredients": ["chocolate", "digestive biscuits"]
}, {
"ingredients": ["cheese", "digestive biscuits"]
}, {
"ingredients": ["cheese", "chocolate", "biscuits"]
}, {
"ingredients": ["cheese", "chocolate", "digestive biscuits"]
}]
(I've intentionally not mixed biscuits
and digestive biscuits
here, I'll explain in a mo.)
I have one search field that will allow people to free type whatever ingredients they choose, and I currently split this out on whitespace to give me an array of terms to use.
I have the mapping as such:
{
"properties": {
"ingredients": {
"type": "string",
"analyzer": "keyword"
}
}
}
The problems I am facing here are that biscuits
does not match digestive biscuits
, and biscuit
does not match anything.
I know I have to analyze the field with a snowball
analyzer, but I am very unsure on how to do this.
Do I need a multi-field approach? Do I need to query with filters too? The results I would like to see are:
biscuit
matching both biscuits
and digestive biscuits
(the latter being scored lower)biscuits
matching both biscuits
and digestive biscuits
(the latter being scored lower)digestive
matching digestive biscuits
digestive biscuits
matching itself and biscuits
(the latter being scored lower)Also, throwing any other term in randomly, how do I handle that? Use a filter or a query?
Very puzzled by how to structure this right from index through mapping and search, so if anyone has any example advice, I'd greatly appreciate it.
Upvotes: 1
Views: 910
Reputation: 14097
First of all, I'd suggest reading this: https://www.elastic.co/guide/en/elasticsearch/guide/current/stemming.html
It discusses exact problem you're facing.
So to fix this, you have to use custom analyzer (it's built using character filters, tokenizer and filters). Analyzer emits tokens from text blob.
So in your specific case, I'll show you how to create a simple custom analyzer to achieve what you want:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_custom": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"kstem"
]
}
}
}
},
"mappings": {
"data": {
"properties": {
"ingredients": {
"type": "string",
"analyzer": "my_analyzer_custom"
}
}
}
}
}
This analyzer will split your text using standard tokenizer and apply these filters:
asciifolding
- normalizes letters with accent characters (é => e)lowercase
- lowercases tokens, so that searches are case insensitivekstem
- filter, that normalizes tokens to their root forms (not ideal, but does a good job). In this case it's going to normalize biscuits into biscuitSo there's your sample data:
PUT /test/data/1
{
"ingredients": ["cheese", "chocolate"]
}
PUT /test/data/2
{
"ingredients": ["chocolate", "biscuits"]
}
PUT /test/data/3
{
"ingredients": ["cheese", "biscuits"]
}
PUT /test/data/4
{
"ingredients": ["chocolate", "digestive biscuits"]
}
PUT /test/data/5
{
"ingredients": ["cheese", "digestive biscuits"]
}
PUT /test/data/6
{
"ingredients": ["cheese", "chocolate", "biscuits"]
}
PUT /test/data/7
{
"ingredients": ["cheese", "chocolate", "digestive biscuits"]
}
And this query:
GET /test/_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.5,
"queries": [
{
"match": {
"ingredients": {
"query": "digestive biscuits",
"type": "phrase",
"boost": 5
}
}
},
{
"match": {
"ingredients": {
"query": "digestive biscuits",
"operator": "and",
"boost": 3
}
}
},
{
"match": {
"ingredients": {
"query": "digestive biscuits"
}
}
}
]
}
}
}
I've used Dis Max Query in this case. You see that there's an array of queries? We're specifying multiple queries there and it brings back document with highest score. From documentation:
A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
So in this case I've specified three queries:
"operator": "and"
, it means that all terms must match regardless of their orderYou can see that for each of them I'm specifying different boost values - that's how you prioritize their importance.
I hope this helps.
Upvotes: 4
Reputation: 12672
This is how I would approach this problem. I created the index with following settings
POST food_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_possessive_stemmer",
"light_english_stemmer",
"asciifolding"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"ingredients": {
"type": "string",
"analyzer": "my_custom_analyzer"
}
}
}
}
}
's
from words so that we can match biscuit's to biscuitAfter that I inserted documents you provided in the questions. I think you need simple query string query. This will satisfy all your requirements as far as scoring
of documents is concerned.
{
"query": {
"query_string": {
"default_field": "ingredients",
"query": "digestive biscuits"
}
}
}
This gave me exactly what you asked for. Please try these settings and query with your dataset and let me know if you face any issues.
I hope this helps!
Upvotes: 4