Reputation: 9509
I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.
In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish:
What I did so far:
I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day
I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there)
But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages.
Any help would really be appreciated!
Upvotes: 9
Views: 4198
Reputation: 1849
Just store comments in a pending state, pass them through Akismet or Defensio, and use the response to mark them as potential spam or mark them active.
I personally prefer Defensio's API but they both work fantastically well.
Upvotes: 0
Reputation: 66154
Yes, like people pointed out, you could look at spam filters or Markov Models.
Something simpler would be to just count the different words in each response and sort by frequency. If words like the following are not at the top then it's probably not valid text:
the, a, in, of, and, or, ...
They are the most frequently used word in any usual English text.
Upvotes: 0
Reputation: 202475
Fidelis Assis and I have been adapting the spam filter OSBF-Lua so that it can easily be adapted to other applications including web applications. This spam filter won the TREC spam contest three years running. (I don't mind bragging because the algorithm is Fidelis's, not mine.)
If you want to try things out, we have "nearly beta" code at
git clone http://www.cs.tufts.edu/~nr/osbf-lua-temp
We are still a long way from having a tidy release, but the code should build provided you install automake 1.9. Either of us would be happy to advise you on how to use it to clean your database and to integrate it into your application.
Upvotes: 2
Reputation: 5870
The preceding answers about strapping up some spam filter Bayesian-inspired classfier are a good idea. For your application, since you seem to get a lot of long nonsense words, it would be best to turn on an option in your parser to train on bigrams and trigrams; otherwise, many of the nonsense words will just be treated as "never seen before" which is not the most useful parse in your case.
Upvotes: 0
Reputation: 30933
Look up Claude Shannon and Markov models. These lead to a statistical technique for assessing probabilities that letter combinations come from a specified language source.
Here are some relevant course notes from Princeton University.
Upvotes: 2
Reputation: 58911
The simplest method would be to count the occurrence of each letter. E is the most common letter in English, so it should be used the most. You could also check for word and digraph frequency. Have a look here to get the list of most frequently used anything in English
Upvotes: 3
Reputation: 60564
I had a spamming problem in a guestbook function on one of my sites a (quite long) while ago. my solution was simply to add a little captcha-like Q&A field asking the user "Are you a spamming robot?" Any answer containing the word "no" (letting through "no, i'm not", "nope" and "not at all" too, just for fun...) permitted the user to post...
The reason I chose not to use captcha was simply that my users wanted a more "cozy" feel to the site, and a captcha felt too formal. This was more personal =)
Upvotes: 5
Reputation: 47452
A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.
Upvotes: 12
Reputation: 7327
If you're only expecting (or care about) English comments, then why not simply count the number of valid words (with respect to some dictionary) in the feedback uploaded. If the number passes some threshold, accept the feedback. If not, trash it. This simple heuristic could be extended to other languages by adding their dictionaries.
Upvotes: 6
Reputation: 17307
How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam
Upvotes: 12