user2255473
user2255473

Reputation: 3

Simple method to identify stop words

I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").

The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).

I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.

Upvotes: 0

Views: 1496

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.

Quoting the Xapian manual: http://xapian.org/docs/stemming.html

It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.

Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.

Upvotes: 1

Related Questions