nasauber.de

Blog

Levenshtein Distance and Longest Common Subsequence in Qt

In case anybody needs to calculate the Levenshtein Distance or the Longest Common Subsequence of two QStrings, here's some code I wrote/found/adapted after quite some investigation.

This Levenshtein Distance function seems to be quite nicely optimized to me (an amateur programmer), as it does not calculate the whole comparison matrix, but only keeps the last column:

int levenshteinDistance(const QString &source, const QString &target)
{
    // Mostly stolen from https://qgis.org/api/2.14/qgsstringutils_8cpp_source.html

    if (source == target) {
        return 0;
    }

    const int sourceCount = source.count();
    const int targetCount = target.count();

    if (source.isEmpty()) {
        return targetCount;
    }

    if (target.isEmpty()) {
        return sourceCount;
    }

    if (sourceCount > targetCount) {
        return levenshteinDistance(target, source);
    }

    QVector<int> column;
    column.fill(0, targetCount + 1);
    QVector<int> previousColumn;
    previousColumn.reserve(targetCount + 1);
    for (int i = 0; i < targetCount + 1; i++) {
        previousColumn.append(i);
    }

    for (int i = 0; i < sourceCount; i++) {
        column[0] = i + 1;
        for (int j = 0; j < targetCount; j++) {
            column[j + 1] = std::min({
                1 + column.at(j),
                1 + previousColumn.at(1 + j),
                previousColumn.at(j) + ((source.at(i) == target.at(j)) ? 0 : 1)
            });
        }
        column.swap(previousColumn);
    }

    return previousColumn.at(targetCount);
}

And here's the one for the Longest Common Subsequence, to be used for diff-like comparing of two QStrings and generation of a visible representation of their differences:

QString longestCommonSubsequence(const QString &source, const QString &target)
{
    // Mostly stolen from https://www.geeksforgeeks.org/printing-longest-common-subsequence/

    QMap<int, QMap<int, int>> l;
    for (int i = 0; i <= source.count(); i++) {
        for (int j = 0; j <= target.count(); j++) {
            if (i == 0 || j == 0) {
                l[i][j] = 0;
            } else if (source.at(i - 1) == target.at(j - 1)) {
                l[i][j] = l[i - 1][j - 1] + 1;
            } else {
                l[i][j] = std::max(l[i - 1][j], l[i][j - 1]);
            }
        }
    }

    int i = source.count();
    int j = target.count();
    int index = l[source.count()][target.count()];
    QString longestCommonSubsequence;
    while (i > 0 && j > 0) {
        if (source.at(i - 1) == target.at(j - 1)) {
            longestCommonSubsequence[index - 1] = source.at(i - 1);
            i--;
            j--;
            index--;
        } else if (l[i - 1][j] > l[i][j - 1]) {
            i--;
        } else {
            j--;
        }
    }

    return longestCommonSubsequence;
}

Just in case somebody needs this.