Reputation: 485

Removing stop words while indexing files using Apache Lucene

I am working on a project which involves indexing files using Apache Lucene. While I am successfully able to index the files using Lucene but when I see the result, I get many abrupt words probably because I am not removing stop words while indexing.

I read on web that Lucene provides a way to remove the stop words while indexing files. How can I do that?

Upvotes: 0

Answers (2)

sachin

Reputation: 21

If you will use standard analyzer or stop analyzer then stop words like "on, a, an, the" will automatically removed from indexing and you cannot perform searching with stop words. If you want to perform searching with stop words also like "was, is, on" you have to use whitespace analyzer or simple analyzer.

Upvotes: 0

femtoRgon

Reputation: 33351

Lucene's StandardAnalyzer includes a StopFilter that removes some typical stop words from anything passed through it. The standard list of english stop words is pretty short; some articles, pronouns and prepositions, mainly.

If you wish to define your own set of StopWords, the StandardAnalyzer has a couple of constructors allowing ou to pass in your own set of stop words, and particularly, this one. Simply create a CharArraySet containing the desired stop words, and pass it into that constructor and your on your way.

I believe most other typical analyzers have a constructor accepting the same arguments as well (at a glance, it looks like almost all of the language analyzers in analyzers-common follow that pattern)

Of course, be sure and use the same analyzer for both indexing and searching.

Upvotes: 1

Removing stop words while indexing files using Apache Lucene

Answers (2)

Related Questions