Smith Waterman Distance for feature extraction in NLP

I recently competed in a competition. The task was to classify text with multi-labels. Therefore, I started with a basic bag of words approach, which performed quite good. After analyzing the data a bit, I realized that some keywords came up in slightly different representation - which for bag of words is a bit unfavorable. E.g. the keyword years of experience consist of 3 words which aren't handled with a bag of words approach because BOW don't know that these 3 words belong together. You can use ngrams to compensate that a bit, but my experience showed, that this fails most of the times. Also, this example can also appear without the word of or it can appear with other words in between. Often the words have small typos in them or other signes like - or 's, which would e.g. affect the matching of regular expressions.

After a bit classifier tuning a technique called Smith Waterman local alignment, which is normally a bio-informatics algorithm for comparing DNA sequences, came into my mind. I couldn't find a python implementation, so I wrote it myself a really simple version of the metric. For understanding please consider which explains it quite good.

import numpy as np

def smith_waterman_distance(seq1, seq2, match=3, mismatch=-1, insertion=-1, deletion=-1, normalize=1):
    '''simple and general smith waterman distance for NLP feature extraction'''
    # switch sequences, so that seq1 is the longer sequence to search for seq2
    if len(seq2) > len(seq1): seq1, seq2 = seq2, seq1
    # create the distance matrix
    mat = np.zeros((len(seq2) + 1, len(seq1) + 1))
    # iterate over the matrix column wise
    for i in range(1, mat.shape[0]):
        # iterate over the matrix row wise
        for j in range(1, mat.shape[1]):
            # set the current matrix element with the maximum of 4 different cases
            mat[i, j] = max(
                # negative values are not allowed
                # if previous character matches increase the score by match, else decrease it by mismatch
                mat[i - 1, j - 1] + (match if seq1[j - 1] == seq2[i - 1] else mismatch),
                # one character is missing in seq2, so decrease the score by deletion
                mat[i - 1, j] + deletion,
                # one additional character is in seq2, so decrease the scare by insertion
                mat[i, j - 1] + insertion
    # the maximum of mat is now the score, which is returned raw or normalized (with a range of 0-1)
    return np.max(mat) / (len(seq2) * match) if normalize else np.max(mat)

In the next code block you can see two outputs of the function which describe the metric in the context of feature extraction for NLP classification.

seq1 = 'You need a driver\'s-license for doin the job'
seq2 = 'drivers license'
print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1))
#outputs 0.844444444444

seq1 = 'Some random nonesense Sentence of a job description'
seq2 = 'drivers license'
print(smith_waterman_distance(seq1, seq2, 3, -2, -2, -2, 1))
#outputs 0.266666666667

The advantage of the SWD is, that you can punish faults in the string differently. E.g. If you want accept strings, that can have other characters in between, you can just set the insertion parameter to 0 or a small negative value. The same is valid for the deletion parameter which punishes incompleteness.

After calculating the SWD for some handcrafted terms for all samples one have a neat and dense feature vector. In the competition, I stacked it on top of the BOW feature vector, which gives me a performance boost of approximately 8% in f1-score.

OpenSCAD implementation of bezier curves of any degrees

Recently, I saw a Youtube video where someone implemented bezier curves for 3 and 4 degrees, and I thought it would be nice if there would be a version with any degrees. I didn’t find something like that, so I implemented it myself. First, you need a definition of a line (which is surprisingly not a standard datatype in OpenSCAD):

module line(p1,p2,w) {
    hull() {
        translate(p1) circle(r=w);
        translate(p2) circle(r=w);
module polyline(points, index, w) {
    if(index < len(points)) {
        line(points[index - 1], points[index],w);
        polyline(points, index + 1, w);

For creating a bezier curve I had to define a few functions. I’m wondering why OpenSCAD has such a weak function definition, where you basically only can define one statement and not a whole block like in almost every other language. However, I wrote a recursive implementation of sampling one point at the bezier curve and call this function several times to build an array of points.

function choose(n, k)=
     k == 0? 1
    : (n * choose(n - 1, k - 1)) / k;

function _point_on_bezier_rec(points,t,i,c)=
    len(points) == i ? c
    : _point_on_bezier_rec(points,t,i+1,c+choose(len(points)-1,i) * pow(t,i) * pow(1-t,len(points)-i-1) * points[i]);

function _point_on_bezier(points,t)=

//a bezier curve with any number of control points
//points - the control points of the bezier curve (number of points is variable)
//resolution - the sampling resolution of the bezier curve (number of returned points)
//resolution number of samples on the bezier curve
function bezier(points,resolution)=[
for (t =[0:1.0/resolution:1+1.0/(resolution/2)]) _point_on_bezier(points,t)

The first parameter is just an array of 2D-points, defining the control points. Note, that you can define the resolution of the sampling via the second parameter.

In the following example, I created a simple vase with 6 control points. The whole resolution can be defined as always with the $fn global variable. I recommend using a small value while debugging because it is significantly faster. I just set the number of control points of the bezier curve to the same value. With the variables radius, height and strength, I defined the size of the vase. The control points p0 to p5 are relative to these variables to allow a variable size. I used a cylinder for the base plate because it is difficult to define a really flat bezier curve, which can be printed nicely. Or one could use boolean combination to cut the base flat. However, by calling the bezier function, create a polyline with the result points and extrude this with a rotate_extrude, one gets a pretty nice result.

resolution = 100;    
$fn = resolution;

radius = 20;
height = 90;
strength = 1;

p0 = [radius,0];
p1 = [radius*3,height*0.2];
p2 = [-radius,height*0.4];
p3 = [radius*3,height*0.7];
p4 = [0,height*0.8];
p5 = [radius*0.8,height*1];

translate([0,0,-strength]) cylinder(r=radius+strength,h=strength*2);

This could be also combined into one function:

//create a 3D rotational model with a bezier curve of given points, resolution and thickness
module bezier_model(points,resolution,thickness) {
    translate([0,0,thickness/2]) rotate_extrude() polyline(bezier(points,resolution),1,thickness/2);

Result image: