wvp
wvp

Reputation: 1174

Lucene search, word separator unaware

I have list of words for example:

'today
today
t-oday
t oday
t/oda y

How can I retrieve all these words from a Lucene index if I search on the words today or t/oday or 'today.

I actually want the search to be insensitive to ampersand, dash, space and some other characters.

What's is the best way to deal with this situation? Should I write my own analyzer/tokenizer or is there something I can use to perform this search?

I'm using Hibernate Search.

Upvotes: 0

Views: 288

Answers (1)

femtoRgon
femtoRgon

Reputation: 33351

Adding a CharFilter to your analyzer would probably be the best solution. This allows you to preprocess the input, before even the tokenizer is applied. There are some TokenFilter examples in the Hibernate documentation (see example 4.13).

I'd recommend using a MappingCharFilterFactory, and define mapping to strip the characters you aren't interested in.

Stripping all the spaces from the input seems a rather unusual case to me, since that will likely prevent useful tokenization, but I suppose I'll assume you have taken that into consideration.

Upvotes: 0

Related Questions