Elasticsearch hunspell cuts words too much

Question

Consider the following mappings as an example:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell": {
          "type": "hunspell",
          "language": "en_GB"
        }
      },
      "analyzer": {
        "my_test": {
          "type" : "custom",
          "tokenizer": "lowercase",
          "filter": ["my_hunspell"]
        }
      }
    }
  }
}

I've downloaded hunspell dictionaries from official Mozilla page.

Now the issue is that some words, for instance beer are over-analyzed. Following query transforms beer into bee, which is not entirely correct?

POST /test/_analyze?analyzer=my_test&text=beer

{
   "tokens": [
      {
         "token": "bee",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      }
   ]
}

Hunspell syntax is quite hard to understand. What can be done to avoid such a behaviour? Is it possible preserve some words or to add some rule?

eemp · Accepted Answer

If you can make it work with coming up with a list of words to preserve, then the Keyword Marker Token Filter might be worth looking into. It looks like that will prevent the words you want to protect from getting stemmed. Your updated analyzer might look something like:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell": {
          "type": "hunspell",
          "language": "en_GB"
        },
        "protect_my_words": {
          "type": "keyword_marker",
          "keywords_path": 
        }
      },
      "analyzer": {
        "my_test": {
          "type" : "custom",
          "tokenizer": "lowercase",
          "filter": ["protect_my_words", "my_hunspell"]
        }
      }
    }
  }
}

There is also the Pattern Replace Token Filter that might prove useful if you do want to transform particular words or patterns. This can precede the keyword marker token filter.

Elasticsearch hunspell cuts words too much

Answers (1)

Related Questions