Shawn
Shawn

Reputation: 331

How to make elastic search more flexible?

I am currently using this elasticsearch DSL query:

{
    "_source": [
        "title",
        "bench",
        "id_",
        "court",
        "date"
    ],
    "size": 15,
    "from": 0,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": "i r coelho",
                    "fields": [
                        "title",
                        "content"
                    ]
                }
            },
            "filter": [],
            "should": {
                "multi_match": {
                    "query": "i r coelho",
                    "fields": [
                        "title.standard^16",
                        "content.standard"
                    ]
                }
            }
        }
    },
    "highlight": {
        "pre_tags": [
            "<tag1>"
        ],
        "post_tags": [
            "</tag1>"
        ],
        "fields": {
            "content": {}
        }
    }
}

Here's what's happening. If I search for I.r coelhoit returns the correct results. But, if I search for I R coelho (without the period) then it returns a different result. How do I prevent this from happening? I want the search to behave the same even if there are extra periods, spaces, commas etc.

Mapping

{
    "courts_2": {
        "mappings": {
            "properties": {
                "author": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                },
                "bench": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                },
                "citation": {
                    "type": "text"
                },
                "content": {
                    "type": "text",
                    "fields": {
                        "standard": {
                            "type": "text"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "court": {
                    "type": "text"
                },
                "date": {
                    "type": "text"
                },
                "id_": {
                    "type": "text"
                },
                "title": {
                    "type": "text",
                    "fields": {
                        "standard": {
                            "type": "text"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "verdict": {
                    "type": "text"
                }
            }
        }
    }
}

Settings:

{
    "courts_2": {
        "settings": {
            "index": {
                "highlight": {
                    "max_analyzed_offset": "19000000"
                },
                "number_of_shards": "5",
                "provided_name": "courts_2",
                "creation_date": "1581094116992",
                "analysis": {
                    "filter": {
                        "my_metaphone": {
                            "replace": "true",
                            "type": "phonetic",
                            "encoder": "metaphone"
                        }
                    },
                    "analyzer": {
                        "my_analyzer": {
                            "filter": [
                                "lowercase",
                                "my_metaphone"
                            ],
                            "tokenizer": "standard"
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "MZSecLIVQy6jiI6YmqOGLg",
                "version": {
                    "created": "7010199"
                }
            }
        }
    }
}

EDIT Here are the results for I.R coelho from my analyzer - { "tokens": [ { "token": "IR", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "KLH", "start_offset": 4, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 } ] }

Standard analyzer:

{
    "tokens": [
        {
            "token": "i.r",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "coelho",
            "start_offset": 4,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

Upvotes: 0

Views: 493

Answers (1)

glenacota
glenacota

Reputation: 2547

the reason why you have a different behaviour when searching for I.r coelho and I R coelho is that you are using different analyzers on the same fields, i.e., my_analyzer for title and content (must block), and standard (the default) for title.standard and content.standard (should block).

The two analyzers generate different tokens, thus determining a different score when you're searching for I.r coelho (e.g., 2 tokens with the standard analyzer) or I R coelho (e.g., 3 tokens with the standard analyzer). You can test the behaviour of your analyzers by using the analyze API (see the Elastic Documentation).

You have to decide whether this is your desired behaviour.

Updates (after requested clarifications from OP)

The results of the _analyze query confirmed the hypothesis: the two analyzers lead to a different score contribution, and, subsequently, to different results depending on whether your query includes symbol chars or not.

If you don't want the results of your query to be affected by symbols such as dots or upper/lower case, you will need to reconsider what analyzers you want to apply. The ones currently used will never satisfy your requirements. If I understood your requirements correctly, the simple built-in analyzer should be the right one for your use case.

In a nutshell, (1) you should consider to replace the standard built-in analyzer with the simple one, (2) you should decide whether you want that your query applies different scores to the hits based on different analyzers (i.e., the phonetic custom one on the value of the title and content fields, and the simple one on their respective subfield).

Upvotes: 1

Related Questions