After a bit classifier tuning a technique called Smith Waterman local alignment, which is normally a bio-informatics algorithm for comparing DNA sequences, came into my mind. I couldn’t find a python implementation, so I wrote it myself a really simple version of the metric. For understanding please consider https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm which explains it quite good.

import numpy as np def smith_waterman_distance(seq1, seq2, match=3, mismatch=-1, insertion=-1, deletion=-1, normalize=1): '''simple and general smith waterman distance for NLP feature extraction''' # switch sequences, so that seq1 is the longer sequence to search for seq2 if len(seq2) > len(seq1): seq1, seq2 = seq2, seq1 # create the distance matrix mat = np.zeros((len(seq2) + 1, len(seq1) + 1)) # iterate over the matrix column wise for i in range(1, mat.shape[0]): # iterate over the matrix row wise for j in range(1, mat.shape[1]): # set the current matrix element with the maximum of 4 different cases mat[i, j] = max( # negative values are not allowed 0, # if previous character matches increase the score by match, else decrease it by mismatch mat[i - 1, j - 1] + (match if seq1[j - 1] == seq2[i - 1] else mismatch), # one character is missing in seq2, so decrease the score by deletion mat[i - 1, j] + deletion, # one additional character is in seq2, so decrease the scare by insertion mat[i, j - 1] + insertion ) # the maximum of mat is now the score, which is returned raw or normalized (with a range of 0-1) return np.max(mat) / (len(seq2) * match) if normalize else np.max(mat)

In the next code block you can see two outputs of the function which describe the metric in the context of feature extraction for NLP classification.

seq1 = 'You need a driver\'s-license for doin the job' seq2 = 'drivers license' print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1)) #outputs 0.844444444444 seq1 = 'Some random nonesense Sentence of a job description' seq2 = 'drivers license' print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1)) #outputs 0.266666666667

The advantage of the SWD is, that you can punish faults in the string differently. E.g. If you want accept strings, that can have other characters in between, you can just set the insertion parameter to 0 or a small negative value. The same is valid for the deletion parameter which punishes incompleteness.

After calculating the SWD for some handcrafted terms for all samples one have a neat and dense feature vector. In the competition, I stacked it on top of the BOW feature vector, which gives me a performance boost of approximately 8% in f1-score.

]]>