b8: statistical discussion

Index

  1. Why this?
  2. Way of proceeding
  3. Statistical analysis
    1. "Sharp" ratings vs. Gary Robinson's approach
      1. Comparation of the "sharp" rating method and the continious rating
      2. Closer inspection: Two-sided Wilcoxon rank sum test
    2. Using a minimum deviation
      1. Results using a minimum deviation of 0.2
      2. Results using a minimum deviation of 0.4
      3. Visualization of the data
        1. Results for a minimum deviation of 0.2
        2. Results for a minimum deviation of 0.4
      4. Overview of b8's performance
    3. Inspection of the median and mean ratings
  4. Conclusions

Why this?

Programming version 0.4 of b8, I decided to make it's probability calculation variable. Because of that, I wondered with which setting one could make b8's performance better. So I started to do some tests to figure out what default values should be set in the config files.
I'm not a statistician – so if you know it better, please tell me ;-) Anyway – if you experience other values to be better as the "best" values I found, please also contact me.

Here's what I found.

Way of proceeding

I took the wordlist from the b8 installation of my homepage that has been filled with 103 ham and 266 spam texts resulting in 2523 ham and 5120 spam tokens (sharing 333). Further, I took a corpus of spam and ham texts collected by Johannes Rosa with the guestbook of http://www.abi2002amschiller.de/ added to the ham corpus.

Those 350 spam and 207 ham texts were rated using the wordlist, being sure that it wasn't built from these texts. In this way, the texts have not been seen by b8, corresponding to real situation.
As done on my homepage, a spam-cutoff value of 0.7 was set to discriminate "ham" from "spam".

Statistical analysis

"Sharp" ratings vs. Gary Robinson's approach

Paul Graham proposes to do a "sharp" rating in [1], using 0.9998 or 0.9999 for tokens only seen in spam texts resp. 0.0002 or 0.0001 for only-ham tokens. This has been b8's default behaviour from version 0.2. Anyway, Gary Robinson's calculation method for the single token ratings proposed in [2] has been used to rate tokens seen in ham and spam from version 0.1.
The first thing to figure out was if this was meaningful. As the wordlist just shared a small amount of tokens between ham and spam, one can assume that Robinson's approach has no big impact on the results using the "traditional" sharp rating method to calculate the texts's probabilities.

First, the whole ham and spam corpus was rated using the wordlist. Then, the false negatives and false positives were counted and the Sensitivity, Specifity, Positive Predictive Value and Negative Predictive Value have been calculated.
No minimum deviation was set, according to the default values used in b8 so far. Using the "sharp" rating, s was set to 1 and 0.5 was used as the default rating for an unknown token.

Comparation of the "sharp" rating method and the continious rating

Text classifying according to b8's rating
"sharp" ratings=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Spam96.5799.1498.5798.29
False negative3.430.861.431.71
Ham100100100100
False positive0000
Cross tabulation of the results
"sharp" ratings=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Sensitivity96.5799.1498.5798.29
Specifity100100100100
Positive Predictive Value100100100100
Negative Predictive Value94.5298.5797.6497.18

So, Robinson's approach seems to deliver the better results, independent of the value for s. Here's a boxplot of all resulting values for the spam and the ham corpus (click to enlarge):

"Sharp" rating vs. continious rating – spam corpus "Sharp" rating vs. continious rating – ham corpus

Closer inspection: Two-sided Wilcoxon rank sum test

All ratings using the continious calculation and an s value from 1 to 0.03 showed significantly higher ratings for spam than the traditional "sharp" rating method whereas all continious ratings with s values bigger than 0.03 showed no significanty higher ratings for the ham corpus. None of the methods showed a significantly lower rating for ham than another.

Here we can say that Robinson's approach goes ahead of the method used in b8 up to version 0.3.3, but no significant differences between the continious rating method's results depending from the s value were found. Further tests just using this method showed different behaviour depending on the setting of a minimum deviation anyway.

Using a minimum deviation

Results using a minimum deviation of 0.2

It seems that setting a minimum deviation that a token's rating has to have for the spamminess calculation is meaningful. Setting the minimum deviation to 0.2 even increased the ratio of spam texts detected, with still no false positives:

Text classifying according to b8's rating
s=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Spam99.1499.4399.4398.8698.5798.29
False negative0.860.570.571.141.431.71
Ham99.5299.52100100100100
False positive0.480.480000
Cross tabulation of the results
s=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Sensitivity99.1499.4399.4398.8698.5798.29
Specifity99.5299.52100100100100
Positive Predictive Value99.7199.71100100100100
Negative Predictive Value98.5699.0499.0498.197.6497.18

Results using a minimum deviation of 0.4

Setting the minimum deviation to 0.4, b8's Sensitivity increased even more, but at the cost of false positives in each case (except s=0.3, which should be coincidence very likely). So, I consider setting such a high minimum deviation as not acceptable, as false positives are considerably less tolerable than false negatives.

Text classifying according to b8's rating
s=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Spam99.1499.4399.1410098.5798.2998
False negative0.860.570.8601.431.712
Ham98.0797.5897.5898.0799.52 99.5299.52
False positive1.932.422.421.930.480.480.48
Cross tabulation of the results
s=1s=0.9s=0.7s=0.5s=0.3s=0.1s=0.09s=0.07s=0.05s=0.03s=0.01
Sensitivity99.1499.4399.1410098.5798.2998
Specifity98.0797.5897.5898.0799.5299.5299.52
Positive Predictive Value98.8698.5898.5898.8799.7199.7199.71
Negative Predictive Value98.5499.0298.5410097.6397.1796.71

Visualization of the data

The below boxplots show the ratings of the spam and the ham corpus at a specific s value dependent of the minimum deviation used.

Results for a minimum deviation of 0.2

Results for a minimum deviation of 0.2 – spam Results for a minimum deviation of 0.2 – ham

Results for a minimum deviation of 0.4

Results for a minimum deviation of 0.4 – spam Results for a minimum deviation of 0.4 – ham

The situation is the same as with the ratings where no minimum deviation was used: very small s values (0.03 and 0.01) showed a significantly higher rating for the ham corpus with no significant difference to greater s values in the rating of the spam corpus.
The minimum deviation 0.4 ratings show a step in ham ratings between s=0.5 and s=0.3 with a quite low median of the ham rating before, but this setting is not acceptable due to the occurence of false positives.

The ratings using a minimum deviation of 0.4 apparenty produce a tighter box, but no significant differences between the distribution of the ratings using a minimum deviation of 0.4 have been found (probably due to more outliners).

Overview of b8's performance

The following pictures visualize the rate of correctly rated texts depending on the s value used and the minimum deviation set.

Comparing b8's performance using a minimum deviation of 0 and 0.2. Both settings show the best performance with a s value of 0.3. Using 0.2 as the minimum deviation increases the identified spam messages with still no false positives.

b8 performance overview – minimum deviation 0 b8 performance overview – minimum deviation 0.2

Using the higher 0.4 minimum deviation results in the occurance of false positives independent of the s value but a sensitivity not increased much nevertheless:

b8 performance overview – minimum deviation 0.4

Inspection of the median and mean ratings

The following pictures illustrate the median (drawn through) and mean (dashed) ratings of b8 depending on the s and minimum deviation setting, separate for the spam and the ham corpus.

b8's ratings: median and mean – spam b8's ratings: median and mean – ham

Apart from the step in the ham rating between s=0.5 and s=0.3 using a minimum deviation of 0.4 described above, the ham rating seems to be quite constant with few outliners (as the median and mean values are resemblant) independent of the settings used. So should be b8's performance to identify ham texts.

The median value of the spam ratings seems to be quite constant with an s value less than 0.5 used. The mean rating has a maximum at 0.3, as assumed. The higher the minimum deviation is set, the higher get the mean values meaning that more outliners are catched.

Conclusions

So, I think not using the "sharp" ratings, setting s to 0.3 and minDev to 0.2 will cause quite good results :-) Let me know if you know it better and have a lot of fun using b8!

References

  1. Better Bayesian Filtering
  2. Spam Detection
Tobias Leupold (tobias . leupold at web . de)
http://nasauber.de/