We are introducing some new anti-spam technology, which significantly improves the spam detection capabilities of M-Switch Anti-Spam. This note gives a simple explanation of what we have done to achieve these mprovements. First, a quick reminder of the two key metrics for measuring spam detection:
- False positive rate: The percentage of real messages that are analyzed to be spam, and blocked, marked or quarantined as such.
- False negative rate: The percentage of spam that gets through.
Our most important mechanism for spam detection is content analysis. Isode's basic strategy for content analysis is to provide effective generic recognition of spam vs messages (not spam). This will result in a system which will not need frequent updates, and be effective for a wide range of situations. This contrasts with other products we have looked at, that include:
- Systems that keep false negatives low by having a relatively high level of false positive (we have seen arguments that 1% or greater false positive is acceptable, which we strongly disagree with).
- Systems that require extensive use of white lists and black lists to be effective (as the base system is not good enough)
- Systems that need to monitor very large quantities of spam and have frequent data set updates (as they work by matching specific instances of spam, rather than generic spam characteristics).
- Systems with matching rules set by humans. As well as being resource intensive, this leads to false positives, as humans do not easily recognize that a characteristic they use to identify a piece of spam is not uncommon in real messages.
Our current content filtering works by a technique known as Bayesian logic. Each message is examined for a number of characteristics, which include the presence of specific words and "spam characteristics" (e.g., message date significantly in the past). Based on a database of messages and spam, each of these words and characteristics is weighted. The score of each matched feature or word is added together to give a total score, and the message is accepted or rejected on the basis of this score. This approach is quite commonly used by anti-spam vendors - Isode's implementation is characterized by a very high performance engine and carefully built data sets.
This approach has proved effective:
- It produces a very low level (0.1%) of false positive (real messages that are interpreted as spam, and then quarantined or deleted).
- For a low level of false positive, it produces a quite low level of false negatives (5%).
- Data sets have proved stable, and given consistent results without frequent updating.
Over the last two years, we have noticed three things:
- There is a slight increase in false negative rate, mainly due to spammers working to avoid content filters such as ours
- There has been a significant increase in the absolute level of spam, which means that heavily spammed mailboxes see enough spam getting through (false negatives) to be irritating (despite most of it being trapped).
- The use of techniques by spammers to counter Bayesian filters has made it hard to improve performance of data sets.
Because of this we are introducing a new content filtering system. This uses a technique called Support Vector Machine. This technique is used by some other anti-spam vendors, but is much less common than Bayesian. The mathematics behind Support Vector Machines is somewhat intimidating. Those interested are referred to a tutorial by Christopher Bruges, which gives an introduction to the mathematics and pointers to the literature on this technique:
The Isode Support Vector Machine approach uses the same basic inputs as our Bayesian system:
- 124 "spam characteristics"
- 48,000 words (derived from our spam and message samples, eliminating common words, rare words, and some other words)
We've also used a Support Vector Machine in a way that does not require changes to our spam checking, and so we can provide the new features without a product update. Essentially, when a message is checked, a positive or negative value is associated with each word or spam characteristic matched. These numbers are then combined to give a total (we do something a little more complex than just adding the numbers). This spam score then controls processing.
In Bayesian analysis, a simple probabilistic approach is taken. A sample set of messages and spam are analyzed, and each input (word or spam feature) being checked is counted, and weight assigned to each input. An input that often occurs in spam and rarely in messages is given a high weight. The difference with Support Vector Machines is the mechanism used to derive these weights. Isode uses a Support Vector Machine to generate weights, looking at the inputs in combination. It effectively allows the questionto be asked "for a given set of inputs, which set of weights will most effectively separate out spam and messages (based on a sample set of spam and messages)?". This leads to a (very) computationally expensive calculation to determine an optimum set of weights.
It is hard to explain this in very simple terms, but the key difference to Bayesian is that the analysis takes into account the relationship of the inputs (words and spam characteristics) and how this occurs in spam, rather than treating each input in isolation.
While the mathematics behind Support Vector Machines is complex, the basic advantage of this approach is that it enables efficient comparison against a combination of words/spam factors, rather than the Bayesian approach which treats each factor as independent.
The support vector comparison is implemented by Isode in a way which gives very fast spam checking. We are able to provide Support Vector Machine functionality as a data set update to our current spam comparison engine, which gives a clear demonstration of the flexibility of the M-Switch Anti-Spam engine. (The generation of the data set is VERY compute intensive, but this is a one-off process run by Isode).
The measured results for the data set from internal testing are 0.02% false positive leading to 2% false negative. This is for the content detection, and so overall performance of the complete Isode anti-spam solution should be improved by other anti-spam techniques. The threshold can be changed to decrease false positives (leading to a corresponding increase in false negatives) or vice versa.
Its useful to consider what this means in real terms. My mailbox is heavily spammed (about 200 messages per day). A 2% false negative rate means that typically 196 of the spams get caught and 4 get through. False positive is more complex, as it depends on a user's real traffic. A 0.02% false positive rate would mean that one message in 5,000 gets classified as spam (one message every two months or so if you get 100 messages per day). In practice, some users will never see false positives, and some users who get "spammier" messages will see more. Deployment for Isode staff use is giving performance in line with the testing numbers.