Reputation: 13190
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages
However this code except fails on incrementToken() implying that the !!! are removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation is removed, is it removed and if so can i prevent its removal ?
Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new StringReader("!!!"));
assertNotNull(tokenizer);
tokenizer.reset();
assertTrue(tokenizer.incrementToken());
Upvotes: 1
Views: 2112
Reputation: 33341
Yes, punctuation will be removed (speaking very generally, it's more complex that just that). The string you've provided effectively has zero tokens after going through the tokenizer. StandardTokenizer
implements UAX #29, so you can read that over for a complete description.
It does this to separate the input into tokens representing, roughly, words. Since you want punctuation to remain a part of your tokens in one way or another, I'm guessing that indexing words isn't really what you want to do, so StandardTokenizer
is probably just not a good choice.
Upvotes: 2