How to tokenize a sentence based on maximum number of words in Elasticsearch?

Question

I have a string like "This is a beautiful day" What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day" So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?

Oleksii · Accepted Answer

Seems that you're looking for shingle token filter it does exactly what you want.

How to tokenize a sentence based on maximum number of words in Elasticsearch?

Answers (2)

Related Questions