Ed Ganiukov
Ed Ganiukov

Reputation: 91

Lucene 3.6 and custom Tokenizer/Analizer for special chars

I am using Lucene 3.6 and StandardAnalyzer in my project for Index and Search. Such analizer split search query string by all special chars (@, #, -, _).

For example: if I will serach "[email protected] #2nd place", tokenizer create such query string: [somename][gmail][com][2nd][place]. But I need string like this one:[somename@gmail][com][#2nd][place].

So how to exclude such special char from stop chars?

And one question: I need re-index all with new analizer or just can use new analizer with old index?

Thanks!

Upvotes: 1

Views: 315

Answers (1)

mindas
mindas

Reputation: 26703

StandardAnalyzer uses StandardTokenizer for defining grammar rules (word breaks etc.). Documentation of the latter says:

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Quickly peeking into StandardTokenizer code I could guess that removing "<EMAIL>" from TOKEN_TYPES might be sufficient. Or maybe not :-)

And yes, you will need to reindex.

Upvotes: 2

Related Questions