Lucene TokenFilter with EnglishAnalyzer for removing numbers in scientific articles

Question

I am indexing scientific articles with Lucene. I am using the following configuration:

EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_43, EnglishAnalyzer.getDefaultStopSet());

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);

That's good for words. But I would like to remove tokens like "0.99" or "3,14" but preserve text like "H2O" (and if it is possible also "n=3") in one token. I have tried the SimpleAnalyzer but is not what I want.

Any ideas?

Thanks!

Mysterion · Accepted Answer

You could achieve what you want with custom, but simple FilteringTokenFilter, that will filter our all not needed tokens, for example by regexp. All you need to do, is to extend this class and implement accept method

protected boolean accept() throws IOException {
            String token = new String(termAtt.buffer(), 0 ,termAtt.length());
            if (token.matches("[0-9,.]+")) {
                return false;
            }
            return true;
        }

in this case, I'm filtering out all tokens that contains only digits and commas and dots (as possible delimiters)

        Tokenizer whitespaceTokenizer = new WhitespaceTokenizer(reader);
        TokenStream tokenStream = new StopFilter(whitespaceTokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        tokenStream = new ScientificFiltering(tokenStream);

For not filtering n=3 and other similar construction I would recommend to use WhitespaceTokenizer, to split tokens only on whitespace characters.

For a full example take a look here

Lucene TokenFilter with EnglishAnalyzer for removing numbers in scientific articles

Answers (1)

Related Questions