Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

Question

I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
https://stackoverflow.com/a/36237769/462347

My questions:

Why Lucene doesn't filter stop words?
How can I enable the stop words filtering in Lucene 5.5 / 6.0?

Mike · Accepted Answer

The pitfall was in the default Lucene's stop words list, I expected, it is much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

Answers (2)

Related Questions

Apache Lucene doesn&#39;t filter stop words despite the usage of StopAnalyzer and StopFilter

Answers (2)

Related Questions

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter