Reputation: 19646
In Lucene, I wish to index products - and as I understand it, words such as t-shirt get tokenized into "t" and "shirt".
I wish that searches for "shirt" don't match t-shirt - i.e. - treat "t-shirt" as a single token.
What's the simplest way to achieve this?
Cheers.
Upvotes: 1
Views: 564
Reputation: 3035
You could update the rules for the StandardTokenizer
and create a custom one.
To do this regenerate the StandardTokenizerImpl class using JFlex by altering the original rules. (You would need to translate the output to c#)
Then take the code for the StandardTokenizer and alter it to use the newly generated TokenizerImpl from JSFlex.
If you don't need the existing rules in the StandardTokenizer you could also try using the WhiteSpaceTokenizer.
Upvotes: 1