Comments working again
I finally did it and wrote the spam filter that I had promised a while back. It was less work than I thought, actually. Anyway, you can now write comments again.
The filter is a so-called Naive Bayes filter. It calculates the probability that a comment is spam, based on how often the words in the comment were observed in spam comments and in normal comments. The implementation generally follows the english Wikipedia article about this, without any additional heuristics for rare words and the like.
If anyone cares, I can post the code for you all to read. It isn’t that much. The most significant single point that I noticed was that the spam filter might go crazy if it finds a word that was never seen either as spam or as not-spam, which is a so-called zero frequency problem. To solve that, whenever I add a new word, I first set both sightings as spam and sightings as not-spam to one (and then one more for whatever I saw it as). This makes the results slightly less accurate, but it remains good enough to work.
Currently the filter has three levels. If the probability that a post is spam is higher than 95%, then the comment isn’t even written to the database, but rejected immediately. A comment that has a chance of more than 70% is saved, but remains hidden until I’ve decided whether it is spam or not. Every time I make such a decision, the spam filter gets trained a little bit to become more accurate. Of course, I may have to change these thresholds in the future.
Written on July 4th, 2010 at 01:36 am

Deutsche Version
Septdeneuf
Torsten Kammer (admin)
Björn
Torsten Kammer (admin)
Torsten (admin)