Out now: b8 0.6
I just uploaded the new major release of my statistical ("Bayesian") PHP spam filter b8. A lot of work has been done and there are a lot of changes. If you experience any problem with the database update or any bug in general, please contact me!
From the ChangeLog, with comments:
Changes
- Finally did an actually really complete abstraction of the storage backends. Now, the storage backends can really do what they want to store b8's wordlist. In this way, it was possible to change MySQL's database layout to store the data in multiple columns, rather than emulating the Berkeley DB behaviour.
- Kicked out the never-used lastseen parameter. This results in less wasted space the wordlist takes and no more write actions when classifying a text. Data will now only be written to the database when learning or deleting a text.
- Renamed the internal variables to b8*..., combined bayes*text.ham and bayes*texts.spam to b8*texts. This is just consequent and it volunteered to do so, as an update in the database structure was necessary anyhow (update scripts are included in the new release).
- Removed all validate() functions in favor of throwing exceptions when something's wrong. In this way, b8 finally behaves like I wanted it to from the start: when something's wrong, simply no instance of b8 will be created. This was not possible back in PHP 4 times.
- Made the lexer more flexible. Added functions for all split work, that, except for the raw split, can be turned on and off via a new config array.
- The lexer now supports getting BBCode.
- Added an additional check to the lexer to be sure no token will collide with an internal variable.
- Added multibyte support to the degenerator so it is now able to handle non-latin-1 texts in the same way as it handles latin-1-texts. The difference of using or not using multibyte operations will only show up when non-latin-1 text is processed by b8. For example, if we have an unknown token HeLlO!, the degenerator will provide the degenerated versions hello!, HELLO!, Hello!, hello, HELLO, Hello and HeLlO, no matter if multibyte operations are used or not. When we have a non-latin-1 word, we may get a different result. For example, if we have the unknow token ПрИвЕт!, the degenerator will only provide one degenerated version of it when not using multibyte operations: ПрИвЕт. Using multibyte operations, we get the same variants as with the latin-1 word: привет!, ПРИВЕТ!, Привет!, привет, ПРИВЕТ, Привет and ПрИвЕт.
- b8's constructor now takes four config arrays, the third is the lexer config, the fourth is the degenerator config.
Bugfixes
- Removed the ucfirst function from the degenerator and replaced it with a custom one. It did not what I always thought it would do (first letter upper case, rest lower case), but does only converted the first letter to upper case.
- Fixed the MySQL backend so it's now able to handle a get() request for an empty array or an array containing just one token.
- Fixed the MySQL backend when doing a query with no returned result.
- Fixed the lexer to never output an empty array of tokens, but a placeholder token if no token has been found.
Have a lot of fun with b8 :-)