Reputation: 1770

Nest - Elastic Search - Find case-insensitive substring, slashes allowed

Say I've indexed a document in Elastic with these two keyword fields:

"lastModified": "02/03/2020"
"theText": "Hello there"

I want to support case-insensitive substring searches against both fields.

So that the doc is matched when I search "lastModified" with any of these for the QueryString:

"02"
"2/03"
"/0"
"2020"

And the doc should match when I search "theText" for any of these (note case changes):

"helLO"
"lo there"
"the"

You get the idea. I just need a simple case-insensitive substring search. No fuzziness or anything fancy. I've tried wildcards, regexes, escaping the slashes for "lastModified", remapping / to _slash_, and I am stuck. Wildcards work except when there's a slash. How can I get the wildcard approach to work with slashes? Or is there a better way?

Edit

I'd prefer to avoid going the N-Gram route, since the text data could be a very long paragraph and it would create many grams :).

To summarize, my preferred solution would:

Not require N-Grams (our text can be quite long)
Be case-insensitive
Support slashes in the input

For now I'm using an ugly Regex against a Keyword field. It works but feels pretty silly.

Upvotes: 0

Answers (1)

Bhavya

Reputation: 16192

You need to use an n-gram tokenizer for a substring match. Since you want to keep / also, then, you need to add punctuation in the token_chars as well

Adding a working example with index data, mapping, search query, and search result

Index Mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit",
            "punctuation"
          ]
        }
      }
    },
    "max_ngram_diff": 10
  },
  "mappings": {
    "properties": {
      "lastModified": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
      "theText": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Index Data:

{
  "theText": "Hello there"
}
{
  "lastModified": "02/03/2020"
}

Search Query on field 1:

{
  "query": {
    "match": {
      "theText": "helLO"
    }
  }
}

Search Result:

"hits": [
      {
        "_index": "67825121",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.970927,
        "_source": {
          "theText": "Hello there"
        }
      }
    ]

Search Query on field 2:

{
  "query": {
    "match": {
      "lastModified": "2/03"
    }
  }
}

Search Result:

"hits": [
      {
        "_index": "67825121",
        "_type": "_doc",
        "_id": "1",
        "_score": 2.0497348,
        "_source": {
          "lastModified": "02/03/2020"
        }
      }
    ]

Have tried other queries also, and they are showing the correct result as per your use case.

Update 1:

Elasticsearch uses a standard analyzer if no analyzer is specified. Assuming that the lastModified and theText field is of text type, so "02/03/2020" will get tokenized into

{
  "tokens": [
    {
      "token": "02",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "03",
      "start_offset": 3,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "2020",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<NUM>",
      "position": 2
    }
  ]
}

Now, when you are doing a wildcard query on any of the above field, then it will search for the tokens shown above. Since there is no token that matches "2/03", you will be getting empty results for the query.

It is better to use a keyword field for wildcard queries. If you have not explicitly defined any index mapping then you need to add .keyword to both the fields. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after the fields).

Search Query:

{
  "query": {
    "wildcard": {
      "lastModified.keyword": {
        "value": "*2/03*"
      }
    }
  }
}

Search Result:

"hits": [
      {
        "_index": "67825121",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "lastModified": "02/03/2020"
        }
      }
    ]

Search Query:

{
  "query": {
    "wildcard": {
      "theText.keyword": {
        "value": "*lo there*"
      }
    }
  }
}

Search Result:

"hits": [
      {
        "_index": "67825121",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "theText": "Hello there"
        }
      }
    ]

Upvotes: 1

Nest - Elastic Search - Find case-insensitive substring, slashes allowed

Edit

Answers (1)

Related Questions