Reputation: 162
I am indexing scientific articles with Lucene. I am using the following configuration:
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_43, EnglishAnalyzer.getDefaultStopSet());
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
That's good for words. But I would like to remove tokens like "0.99" or "3,14" but preserve text like "H2O" (and if it is possible also "n=3") in one token. I have tried the SimpleAnalyzer
but is not what I want.
Any ideas?
Thanks!
Upvotes: 1
Views: 722
Reputation: 9320
You could achieve what you want with custom, but simple FilteringTokenFilter
, that will filter our all not needed tokens, for example by regexp. All you need to do, is to extend this class and implement accept
method
protected boolean accept() throws IOException {
String token = new String(termAtt.buffer(), 0 ,termAtt.length());
if (token.matches("[0-9,.]+")) {
return false;
}
return true;
}
in this case, I'm filtering out all tokens that contains only digits and commas and dots (as possible delimiters)
Tokenizer whitespaceTokenizer = new WhitespaceTokenizer(reader);
TokenStream tokenStream = new StopFilter(whitespaceTokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream = new ScientificFiltering(tokenStream);
For not filtering n=3 and other similar construction I would recommend to use WhitespaceTokenizer
, to split tokens only on whitespace characters.
For a full example take a look here
Upvotes: 3