Dave Bish
Dave Bish

Reputation: 19646

Lucene.Net include hyphen in tokenizer

In Lucene, I wish to index products - and as I understand it, words such as t-shirt get tokenized into "t" and "shirt".

I wish that searches for "shirt" don't match t-shirt - i.e. - treat "t-shirt" as a single token.

What's the simplest way to achieve this?

Cheers.

Upvotes: 1

Views: 564

Answers (1)

Dreamwalker
Dreamwalker

Reputation: 3035

You could update the rules for the StandardTokenizer and create a custom one.

To do this regenerate the StandardTokenizerImpl class using JFlex by altering the original rules. (You would need to translate the output to c#)

Then take the code for the StandardTokenizer and alter it to use the newly generated TokenizerImpl from JSFlex.

If you don't need the existing rules in the StandardTokenizer you could also try using the WhiteSpaceTokenizer.

Upvotes: 1

Related Questions