nasauber.de

Blog: Einträge 02.05.2019–18.03.2020

b8 0.7 out now

My oldest still maintained project b8, the statistical PHP spam filter, got an overall code refactoring and modernization. After all the years, this really was necessary!

Here's what has been changed, as to find in Update from prior versions in the readme:

Overall code rework

The code has been modernized a lot since the last release. Most notably, namespaces have been added. So, you have to instantiate b8 e. g. like this now:

$b8 = new b8\b8(...);

To use the constants, please also add the namespace, e. g. b8\b8::HAM.

Due to the namespace introduction, the default degenerator and lexer can't be called default anymore. The name is now standard (e. g. b8\lexer\standard).

Storage backend approach change

The storage backends now leave the connection to a database to the user (where it belongs). The Berkeley DB (DBA) storage backend remains the reference one. The other remaining one shows how to store b8's wordlist in a MySQL table, more as an example how to implement a proper storage backend. The base storage class now has all needed functions added as abstract definitions, so that everybody can easily implement their needed backend. Also, some function names have been changed to more meaningful ones.

The DBA backend now simply wants to have a working DBA resource as the only parameter. So if you use this, you would do e. g.:

$db = dba_open('wordlist.db', 'w', 'db4');
$config_dba = [ 'resource' => $db ];

and pass this to b8.

The (example) MySQL backend takes a mysqli object and a table name as config keys. Simply look at the backends themselves to see the changes.

If you implemented your own backend, you will have to update it. But this should be quite straightforward.

Please notice the newly added start_transaction() function. Actually, with MySQL's MyISAM engine that was the default back then, transactions didn't even exist (man, this project is actually quite old ;-)!

Additionally, the PostgreSQL backend and the original MySQL backend (using the long-deprecated mysql functions, not the mysqli ones) have been removed.

New default configuration

The default configuration of the lexer and the degenerator has also been changed.

The degenerator now uses multibyte operations by default. This needs PHP's mbstring module. If you don't have it, set multibyte to false in the config array.

Speaking of the lexer, the legacy HTML extractor has been removed, alongside with it's old_get_html config option.

Please update your configuration arrays!

Have a lot of fun with b8 :-)


Calculating the difference between two QDates

I wanted to calculate the difference between two QDates. Not only the days, but also the years, months, and weeks (for use in KPhotoAlbum).

I ended up using the following algorithm:

[Update 2024-02-07]: In some cases, the days weren't calculated correctly, becase a day surpassing the days in that specific month has been used, resulting in an invalid QDate. This is now fixed.

[Update 2024-03-05]: Now, we also calculate timespans correctly if the date we refer to is a February 29.

struct DateDifference
{
    int years;
    int months;
    int days;

    bool operator==(const DateDifference &other) const
    {
        return    this->years  == other.years
               && this->months == other.months
               && this->days   == other.days;
    }

    bool operator!=(const DateDifference &other) const
    {
        return    this->years  != other.years
               || this->months != other.months
               || this->days   != other.days;
    }
};

DateDifference dateDifference(const QDate &date, const QDate &reference)
{
    if (date > reference) {
        return dateDifference(reference, date);
    }

    int dateDay = date.day();
    if (date.month() == 2 && dateDay == 29
        && ! QDate::isLeapYear(reference.year())) {
        // If we calculate the timespan to a February 29 for a non-leap year, we use February 28
        // instead (the last day in February). This will also make birthdays for people born on
        // February 29 being calculated correctly (February 28, the last day in February, for
        // non-leap years)
        dateDay = 28;
    }

    int years = reference.year() - date.year();
    int months = reference.month() - date.month();
    if (reference.month() < date.month()
        || ((reference.month() == date.month()) && (reference.day() < dateDay))) {
        years--;
        months += 12;
    }
    if (reference.day() < dateDay) {
        months--;
    }

    int remainderMonth = reference.month() - (reference.day() < dateDay);
    int remainderYear = reference.year();
    if (remainderMonth == 0) {
        remainderMonth = 12;
        remainderYear--;
    }

    const auto daysOfRemainderMonth = QDate(remainderYear, remainderMonth, 1).daysInMonth();
    const auto remainderDay = dateDay > daysOfRemainderMonth ? daysOfRemainderMonth : dateDay;

    int days = QDate(remainderYear, remainderMonth, remainderDay).daysTo(reference);

    return { years, months, days };
}

Perhaps, this will help somebody.

I also filed a Feature request about adding such a function to Qt directly. Hopefully, it will be added in Qt 5.14 (to the new QCalendar class) :-)


iBlue 747A+ delivers wrong date

Back from our vacation in the Bavarian Forest, I noticed that my long-serving and really reliable iBlue 747A+ GPS logger apparently stopped working: it delivered tracks with a date around year 2000. Investigating this further, I found that the time appeared to be okay, but the date was completely wrong.

Fortunately, the device isn't broken, it was just hit by the GPS Week Number Rollover, that took place on 2019-04-07. This really sucks! But blessedly, I can continue using the logger. One can fix the wrong date by adding 172,032 hours to the date (that is: 1024 weeks times 7 days times 24 hours).

This can be done via gpsbabel in the following way (assuming we have a GPX encoded GPS data file):

gpsbabel -t -i gpx -f original.gpx -x track,move=+172032h -o gpx -F fixed.gpx

Maybe, this will help somebody.

Hopefully, Europe's Galileo GNSS won't suffer from such shortcomings and my next "GPS" logger will use this one … and the good ole' one will continue working a few years until Galileo is finished, working and a lot of devices support it ;-)


Levenshtein Distance and Longest Common Subsequence in Qt

In case anybody needs to calculate the Levenshtein Distance or the Longest Common Subsequence of two QStrings, here's some code I wrote/found/adapted after quite some investigation.

This Levenshtein Distance function seems to be quite nicely optimized to me (an amateur programmer), as it does not calculate the whole comparison matrix, but only keeps the last column:

int levenshteinDistance(const QString &source, const QString &target)
{
    // Mostly stolen from https://qgis.org/api/2.14/qgsstringutils_8cpp_source.html

    if (source == target) {
        return 0;
    }

    const int sourceCount = source.count();
    const int targetCount = target.count();

    if (source.isEmpty()) {
        return targetCount;
    }

    if (target.isEmpty()) {
        return sourceCount;
    }

    if (sourceCount > targetCount) {
        return levenshteinDistance(target, source);
    }

    QVector<int> column;
    column.fill(0, targetCount + 1);
    QVector<int> previousColumn;
    previousColumn.reserve(targetCount + 1);
    for (int i = 0; i < targetCount + 1; i++) {
        previousColumn.append(i);
    }

    for (int i = 0; i < sourceCount; i++) {
        column[0] = i + 1;
        for (int j = 0; j < targetCount; j++) {
            column[j + 1] = std::min({
                1 + column.at(j),
                1 + previousColumn.at(1 + j),
                previousColumn.at(j) + ((source.at(i) == target.at(j)) ? 0 : 1)
            });
        }
        column.swap(previousColumn);
    }

    return previousColumn.at(targetCount);
}

And here's the one for the Longest Common Subsequence, to be used for diff-like comparing of two QStrings and generation of a visible representation of their differences:

QString longestCommonSubsequence(const QString &source, const QString &target)
{
    // Mostly stolen from https://www.geeksforgeeks.org/printing-longest-common-subsequence/

    QMap<int, QMap<int, int>> l;
    for (int i = 0; i <= source.count(); i++) {
        for (int j = 0; j <= target.count(); j++) {
            if (i == 0 || j == 0) {
                l[i][j] = 0;
            } else if (source.at(i - 1) == target.at(j - 1)) {
                l[i][j] = l[i - 1][j - 1] + 1;
            } else {
                l[i][j] = std::max(l[i - 1][j], l[i][j - 1]);
            }
        }
    }

    int i = source.count();
    int j = target.count();
    int index = l[source.count()][target.count()];
    QString longestCommonSubsequence;
    while (i > 0 && j > 0) {
        if (source.at(i - 1) == target.at(j - 1)) {
            longestCommonSubsequence[index - 1] = source.at(i - 1);
            i--;
            j--;
            index--;
        } else if (l[i - 1][j] > l[i][j - 1]) {
            i--;
        } else {
            j--;
        }
    }

    return longestCommonSubsequence;
}

Just in case somebody needs this.