Elasticsearch tokenizer to keep (and concatenate) "and"

I am trying to make an Elasticsearch filter, analyzer and tokenizer to be able to normalize searches like:

In other words, I would like to normalize my "and" and "&" queries, but also concatenate the words between them.

I'm thinking of making a tokenizer that breaks "henry & william book" into tokens ["henry & william", "book"], and then make a character filter that makes the following replacements:

However, this feels a bit hackish. Is there a better way to do it?

The reason I can't just do this entirely in the analyzer/filter phase, is that it runs too late. In my attempts, Elasticsearch has already broken "henry & william" into just ["henry", "william"] before my analyzer/filter runs.

Upvotes: 2

Views: 351

Answers (2)

Val
Val

Reputation: 217564

You can use a clever mix of two character filters that kick in before the tokenizer. The first character filter would map and onto & and the second character filter would get rid of the & and glue the two neighboring tokens together. This mix would also allow you to introduce other replacements, such as | and or for instance.

PUT test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "and": {
          "type": "mapping",
          "mappings": [
            "and => &"
          ]
        },
        "&": {
          "type": "pattern_replace",
          "pattern": """(\w+)(\s*&\s*)(\w+)""",
          "replacement": "$1$3"
        }
      },
      "analyzer": {
        "my-analyzer": {
          "type": "custom",
          "char_filter": [
            "and", "&"
          ],
          "tokenizer": "keyword"
        }
      }
    }
  }
}

This would yields the following results:

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry&william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry & william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry and william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henrywilliam book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

POST test/_analyze
{
  "analyzer": "my-analyzer",
  "text": [
    "henry william book"
  ]
}

Results =>

{
  "tokens" : [
    {
      "token" : "henry william book",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

Upvotes: 1

oliver_t
oliver_t

Reputation: 1105

All you need is a single character filter and and some knowledge of regular expressions. Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

{
    "settings": {
        "analysis": {
            "char_filter": {
                "remove_and": {
                    "type": "pattern_replace",
                    "pattern": """\s*(&|\band\b)\s*""",
                    "description": "Removes ands and ampersands"
                }
            },
            "analyzer": {
                "book-analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "remove_and"
                    ],
                    "tokenizer": "keyword"
                }
            }
        }
    }
}

Explanation:

  • \s* optional whitespaces around your expression
  • \b word boundaries around the 'and', e.g. not to react in candy

Upvotes: 1

Related Questions