Smith Waterman Distance for feature extraction in NLP

3 minute read

I recently competed in a http://hackerrank.com competition. The task was to classify text with multi-labels. Therefore, I started with a basic bag of words approach, which performed quite good. After analyzing the data a bit, I realized that some keywords came up in slightly different representation - which for bag of words is a bit unfavorable. E.g. the keyword years of experience consist of 3 words which aren’t handled with a bag of words approach because BOW don’t know that these 3 words belong together. You can use ngrams to compensate that a bit, but my experience showed, that this fails most of the times. Also, this example can also appear without the word of or it can appear with other words in between. Often the words have small typos in them or other signes like - or ‘s, which would e.g. affect the matching of regular expressions.

After a bit classifier tuning a technique called Smith Waterman local alignment, which is normally a bio-informatics algorithm for comparing DNA sequences, came into my mind. I couldn’t find a python implementation, so I wrote it myself a really simple version of the metric. For understanding please consider https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm which explains it quite good.

import numpy as np

def smith_waterman_distance(seq1, seq2, match=3, mismatch=-1, insertion=-1, deletion=-1, normalize=1):
    '''simple and general smith waterman distance for NLP feature extraction'''
    # switch sequences, so that seq1 is the longer sequence to search for seq2
    if len(seq2) > len(seq1): seq1, seq2 = seq2, seq1
    # create the distance matrix
    mat = np.zeros((len(seq2) + 1, len(seq1) + 1))
    # iterate over the matrix column wise
    for i in range(1, mat.shape[0]):
        # iterate over the matrix row wise
        for j in range(1, mat.shape[1]):
            # set the current matrix element with the maximum of 4 different cases
            mat[i, j] = max(
                # negative values are not allowed
                0,
                # if previous character matches increase the score by match, else decrease it by mismatch
                mat[i - 1, j - 1] + (match if seq1[j - 1] == seq2[i - 1] else mismatch),
                # one character is missing in seq2, so decrease the score by deletion
                mat[i - 1, j] + deletion,
                # one additional character is in seq2, so decrease the scare by insertion
                mat[i, j - 1] + insertion
            )
    # the maximum of mat is now the score, which is returned raw or normalized (with a range of 0-1)
    return np.max(mat) / (len(seq2) * match) if normalize else np.max(mat)

In the next code block you can see two outputs of the function which describe the metric in the context of feature extraction for NLP classification.

seq1 = 'You need a driver\'s-license for doin the job'
seq2 = 'drivers license'
print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1))
#outputs 0.844444444444

seq1 = 'Some random nonesense Sentence of a job description'
seq2 = 'drivers license'
print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1))
#outputs 0.266666666667

The advantage of the SWD is, that you can punish faults in the string differently. E.g. If you want accept strings, that can have other characters in between, you can just set the insertion parameter to 0 or a small negative value. The same is valid for the deletion parameter which punishes incompleteness.

After calculating the SWD for some handcrafted terms for all samples one have a neat and dense feature vector. In the competition, I stacked it on top of the BOW feature vector, which gives me a performance boost of approximately 8% in f1-score.

Categories:

Updated: