Elasticsearch + NEST: Use token-filter only for comparisation but not on analyzers result

Question

I want to build an analyzer in elasticsearch that ignores the cases of its input while comparisation but returns case sensitive results.

This is my actual state:

My NEST Code to create the analyzer

{ "MySynonymFilter", new SynonymTokenFilter { SynonymsPath = "Path/SynonymFile.txt", Lenient = true} },

{
    "MySynonymizer", new CustomAnalyzer
    {
        Tokenizer = "whitespace",
        Filter = new List {"lowercase", "MySynonymFilter"}
    }
},

This is how the analyzer created above looks like:

"Synonymizer": {
    "filter": [
        "lowercase",
        "MySynonymFilter"
     ],
    "type": "custom",
    "tokenizer": "whitespace"
},

My Synonymfile ("Path/SynonymFile.txt"):

one, two, three, four => FIVE

This is actual result and desired result:

Example query:

localhost:port/index/_analyze
{
  "analyzer": "MySynonymizer",
  "text":      "Input"
}

Actual result:

Input: "one"              Output: ["five"]
Input: "One tWo THREE"    Output: ["five", "five", "five"]
Input: "one TWO foo"      Output: ["five", "five", "foo"]

Result when the lowercase filter is removed:

Input: "one"              Output: ["FIVE"]
Input: "One tWo THREE"    Output: ["One", "tWo", "THREE"]
Input: "one TWO foo"      Output: ["FIVE", "TWO", "foo"]

Desired result:

Input: "one"              Output: ["FIVE"]
Input: "One tWo THREE"    Output: ["FIVE", "FIVE", "FIVE"]
Input: "one TWO foo"      Output: ["FIVE", "FIVE", "foo"]

Hooman Bahreini · Accepted Answer

Note that Analyze API performs analysis on your input text and returns tokens. These tokens are the output of analyzer, but these are not the final output, we will use these tokens to perform the actual search.

What you want could have been achieved in the earlier version of Elasticsearch, using ignore_case parameter:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "ignore_case": "true", // <-- deprecated
                        "synonyms" : ["one, two, three => FIVE"]
                    }
                }
            }
        }
    }
}

And then you could analyze the text without using "lowercase" token filter:

GET /test_index/_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["synonym"] ,
  "text" : "One two three" // --> result: "FIVE", "FIVE", "FIVE"
}

So your synonyms would ignore case and the analyzer was not converting anything to lowercase... but ignore_case has been deprecated. If you try this code, you will get the following message:

Deprecation: The ignore_case option on the synonym_graph filter is deprecated. Instead, insert a lowercase filter in the filter chain before the synonym_graph filter.

What you want to achieve is no longer possible (and it makes sense). If your search is case-sensitive, then your synonyms are case sensitive too... if you want to ignore case, then use "lowercase" token filter...

Elasticsearch + NEST: Use token-filter only for comparisation but not on analyzers result

Answers (1)

Related Questions