I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db. In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish: nonsense: jfvgasdjkfahs kdlfjhasdf (People smashing their heads on the keyboard) language i don't understand What I did so far: I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there) I filter out any messages containing letters not being used in my language -> 400 per day But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages. Any help would really be appreciated!

Reputation: 9509

Algorithm for separating nonsense text from meaningful text

I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.

In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish:

nonsense: jfvgasdjkfahs kdlfjhasdf (People smashing their heads on the keyboard)
language i don't understand

What I did so far:

I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day
I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there)
I filter out any messages containing letters not being used in my language -> 400 per day

But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages.

Any help would really be appreciated!

Upvotes: 9

Answers (11)

Jarin Udom

Reputation: 1849

Just store comments in a pending state, pass them through Akismet or Defensio, and use the response to mark them as potential spam or mark them active.

http://akismet.com/

http://defensio.com/

I personally prefer Defensio's API but they both work fantastically well.

Upvotes: 0

Frank

Reputation: 66154

Yes, like people pointed out, you could look at spam filters or Markov Models.

Something simpler would be to just count the different words in each response and sort by frequency. If words like the following are not at the top then it's probably not valid text:

the, a, in, of, and, or, ...

They are the most frequently used word in any usual English text.

Upvotes: 0

Norman Ramsey

Reputation: 202475

Fidelis Assis and I have been adapting the spam filter OSBF-Lua so that it can easily be adapted to other applications including web applications. This spam filter won the TREC spam contest three years running. (I don't mind bragging because the algorithm is Fidelis's, not mine.)

If you want to try things out, we have "nearly beta" code at

git clone http://www.cs.tufts.edu/~nr/osbf-lua-temp

We are still a long way from having a tidy release, but the code should build provided you install automake 1.9. Either of us would be happy to advise you on how to use it to clean your database and to integrate it into your application.

Upvotes: 2

Liudvikas Bukys

Reputation: 5870

The preceding answers about strapping up some spam filter Bayesian-inspired classfier are a good idea. For your application, since you seem to get a lot of long nonsense words, it would be best to turn on an option in your parser to train on bigrams and trigrams; otherwise, many of the nonsense words will just be treated as "never seen before" which is not the most useful parse in your case.

Upvotes: 0

joel.neely

Reputation: 30933

Look up Claude Shannon and Markov models. These lead to a statistical technique for assessing probabilities that letter combinations come from a specified language source.

Here are some relevant course notes from Princeton University.

Upvotes: 2

oglester

Reputation: 6670

You might try the Bayesian algorithm used by many spam filters.

Better Bayesian Filtering

Wikipedia explanation

Some open Source

Upvotes: 6

Marius

Reputation: 58911

The simplest method would be to count the occurrence of each letter. E is the most common letter in English, so it should be used the most. You could also check for word and digraph frequency. Have a look here to get the list of most frequently used anything in English

Upvotes: 3

Tomas Aschan

Reputation: 60564

I had a spamming problem in a guestbook function on one of my sites a (quite long) while ago. my solution was simply to add a little captcha-like Q&A field asking the user "Are you a spamming robot?" Any answer containing the word "no" (letting through "no, i'm not", "nope" and "not at all" too, just for fun...) permitted the user to post...

The reason I chose not to use captcha was simply that my users wanted a more "cozy" feel to the site, and a captcha felt too formal. This was more personal =)

Upvotes: 5

Rob Walker

Reputation: 47452

A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.

Upvotes: 12

maxaposteriori

Reputation: 7327

If you're only expecting (or care about) English comments, then why not simply count the number of valid words (with respect to some dictionary) in the feedback uploaded. If the number passes some threshold, accept the feedback. If not, trash it. This simple heuristic could be extended to other languages by adding their dictionaries.

Upvotes: 6

John Nilsson

Reputation: 17307

How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam

Upvotes: 12

Algorithm for separating nonsense text from meaningful text

Answers (11)

Related Questions