Elasticsearch tokenizer to keep (and concatenate) "and"

Question

I am trying to make an Elasticsearch filter, analyzer and tokenizer to be able to normalize searches like:

"henry&william book" -> "henrywilliam book"
"henry & william book" -> "henrywilliam book"
"henry and william book" -> "henrywilliam book"
"henry william book" -> "henry william book"

In other words, I would like to normalize my "and" and "&" queries, but also concatenate the words between them.

I'm thinking of making a tokenizer that breaks "henry & william book" into tokens ["henry & william", "book"], and then make a character filter that makes the following replacements:

" & " -> ""
" and " -> ""
"&" -> ""

However, this feels a bit hackish. Is there a better way to do it?

The reason I can't just do this entirely in the analyzer/filter phase, is that it runs too late. In my attempts, Elasticsearch has already broken "henry & william" into just ["henry", "william"] before my analyzer/filter runs.

Val · Accepted Answer

You can use a clever mix of two character filters that kick in before the tokenizer. The first character filter would map and onto & and the second character filter would get rid of the & and glue the two neighboring tokens together. This mix would also allow you to introduce other replacements, such as | and or for instance.

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "and": {
          "type": "mapping",
          "mappings": [
            "and => &"
          ]
        },
        "&": {
          "type": "pattern_replace",
          "pattern": """(\w+)(\s*&\s*)(\w+)""",
          "replacement": "$1$3"
        }
      },
      "analyzer": {
        "my-analyzer": {
          "type": "custom",
          "char_filter": [
            "and", "&"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

This would yields the following results:

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry&william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry & william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry and william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henry william book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

Elasticsearch tokenizer to keep (and concatenate) "and"

Answers (2)

Related Questions

Elasticsearch tokenizer to keep (and concatenate) &quot;and&quot;

Answers (2)

Related Questions

Elasticsearch tokenizer to keep (and concatenate) "and"