Paul Taylor
Paul Taylor

Reputation: 13190

Does StandardTokenizer remove punctuation (in Lucene 4.1)

Does StandardTokenizer remove punctuation (in Lucene 4.1)

Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages

However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i prevent its removal ?

Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());

Upvotes: 1

Views: 2112

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

Yes, punctuation will be removed (speaking very generally, it's more complex that just that). The string you've provided effectively has zero tokens after going through the tokenizer. StandardTokenizer implements UAX #29, so you can read that over for a complete description.

It does this to separate the input into tokens representing, roughly, words. Since you want punctuation to remain a part of your tokens in one way or another, I'm guessing that indexing words isn't really what you want to do, so StandardTokenizer is probably just not a good choice.

Upvotes: 2

Related Questions