Reputation: 699
What is the best approach for searching some keywords inside a field that contains a big text in a ElasticSearch index?
I have some words that I want to search inside a field named my_field
with these constraints:
open ai
or openai
(in lowercase). I want all of these combinations to be searched, but prioritized the results with the exact match.Let's make an example. My words are:
cto
open
ai
So I can keep them separated or treated like a string "cto open ai"
, in google search engine style. The words can be also:
cto
openai
because they come from an algorithm that extracts keywords from a text and can split unique keywords in 2 "common" words or not.
The document I want as the first result has a my_field
that contains a long text with: ".....cto.....open ai..."
. So I tried with a match
query since I read there is the fuzziness
parameter to control the Levenshtein distance.
With these 2 queries the result is found:
Query ok 1 (fuzziness 0
with 3 terms):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
Query ok 2 (fuzziness 0
with 1 string):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
(even if I change the order of the words in the query
).
But I want to find the same result even if:
open ai
openai
, because it's a little change/typo.So I tried with:
Query error 3 (fuzziness AUTO
with 2 terms and typo):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
But it finds other results before it and the strange thing is that if I use the same query of case 1, but with AUTO
in place of 0
, it finds other documents before, that maybe have only 1/3 words in the my_field
, and not all of the 3. While I know that 1 document contains all of the 3 words exactly, so I don't understand why this is not prioritized:
Query error 4 (fuzziness AUTO
with the 3 original terms that worked before with 0
):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
I tried also with a mixed approach, given a boost
to the match without "fuzziness"="AUTO"
, but with no luck:
Query error 5 (mixed fuzziness with 2 terms and typo):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "boost": 10 }}},
{ "match": { "my_field": { "query": "openai", "boost": 10 }}},
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
So how can I make a query flexible to all of these typos/litlle changes and see prioritized the documents that contains perfectly the possible combinations?
Upvotes: 4
Views: 120
Reputation: 30153
I would index my_field twice, once as is and then second time where I would first split words on cases but then combine words in bigrams using shingle filter. In the search I would search both the original field and the bigrams field giving the original field higher boost.
There are different ways of doing this depending on how many words mingled together you want to match the boost level, etc, but hopefully this example will get you started:
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"tuples_index": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false,
"token_separator": ""
},
"tuples_search": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": true,
"token_separator": ""
}
},
"analyzer": {
"standard_shingle_index": {
"tokenizer": "standard",
"filter": [ "word_delimiter", "lowercase", "tuples_index" ]
},
"standard_shingle_search": {
"tokenizer": "standard",
"filter": [ "word_delimiter", "lowercase", "tuples_search" ]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"tuples": {
"type": "text",
"analyzer": "standard_shingle_index",
"search_analyzer": "standard_shingle_search"
}
}
}
}
}
}
PUT my_index/_bulk?refresh
{"index": {}}
{"my_field": "Mira Murati (born 1988) is a United States-based, Albanian-born engineer, researcher and business executive. She is currently the chief technology officer of OpenAI, the artificial intelligence research company that develops ChatGPT." }
{"index": {}}
{"my_field": "Women You Should Know: Mira Murati, CTO of Open A.I." }
GET my_index/_validate/query?explain
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "OpenAI",
"boost": 2
}
}
},
{
"match": {
"my_field.tuples": {
"query": "OpenAI"
}
}
}
]
}
}
}
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "Open AI",
"boost": 2
}
}
},
{
"match": {
"my_field.tuples": {
"query": "Open AI"
}
}
}
]
}
}
}
Upvotes: 0