Reputation: 3
I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").
The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).
I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.
Upvotes: 0
Views: 1496
Reputation: 77474
Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.
Quoting the Xapian manual: http://xapian.org/docs/stemming.html
It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.
Upvotes: 1