michael
michael

Reputation: 1767

How should I index and search on hyphenated words in English?

I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.

In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia and preeclampsia. At the moment, searching for one won't find the other (the standard tokenizer indexes the hyphenated version as pre eclampsia).

This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude and happy go lucky should match jean-claude and happy-go-lucky.

One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word into hyphenated-word OR hyphenatedword. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)

Upvotes: 2

Views: 669

Answers (1)

MatsLindh
MatsLindh

Reputation: 52862

You can use a compound word token filter - hyphenation_decompounder should probably work decent enough.

It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder is suitable for doing stuff like that.

This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.

Upvotes: 1

Related Questions