minh nguyen
minh nguyen

Reputation: 65

How to tokenize a sentence based on maximum number of words in Elasticsearch?

I have a string like "This is a beautiful day" What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day" So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?

Upvotes: 1

Views: 172

Answers (2)

Abd Rmdn
Abd Rmdn

Reputation: 550

As what @Oleksii said. in your case max_shingle_size = 2 (which is the default), and min_shingle_size = 1.

Upvotes: 0

Oleksii
Oleksii

Reputation: 154

Seems that you're looking for shingle token filter it does exactly what you want.

Upvotes: 1

Related Questions