Reputation: 65
I have a string like "This is a beautiful day" What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day" So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?
Upvotes: 1
Views: 172
Reputation: 550
As what @Oleksii said. in your case max_shingle_size = 2 (which is the default), and min_shingle_size = 1.
Upvotes: 0
Reputation: 154
Seems that you're looking for shingle token filter it does exactly what you want.
Upvotes: 1