dark_shadow
dark_shadow

Reputation: 3573

How to implement information retrieval techniques in Naive Bayesian Spam Filter?

I have implemented a Naive Bayesian Spam Filter which learns on a given data set and then predicts any new input as spam or ham.But now I want to incorporate the information retrieval techniques in it so as to improve the effectiveness of Filter.For example, correction of spelling mistake like if instead of viagra v1agra is written or m0rtgage is written then Naive Bayesian should correct it and it shouldn't create any problem in calculation of probability.

Any good tutorials or some work on incorporating information retrieval techniques with some implementation in Java will be of great help.

Also what other techniques can be used to improve the effectiveness of filter ?

Thanks in advance.

Upvotes: 1

Views: 598

Answers (1)

gvd
gvd

Reputation: 1831

What you are looking for is called word stemming. This is often used to remove differences like "walking" vs "walked" (a Porter Stemmer would convert both words to "walk"). In your case you want to setup some rules that remove much of the spam noise (remove all non-alpha characters? make all words lower-case, etc.)

Upvotes: 1

Related Questions