Smith Waterman Distance for feature extraction in NLP

I recently competed in a http://hackerrank.com competition. The task was to classify text with multi-labels. Therefore, I started with a basic bag of words approach, which performed quite good. After analyzing the data a bit, I realized that some keywords came up in slightly different representation - which for bag of words is a bit unfavorable. E.g. the keyword years of experience consist of 3 words which aren't handled with a bag of words approach because BOW don't know that these 3 words belong together. You can use ngrams to compensate that a bit, but my experience showed, that this fails most of the times. Also, this example can also appear without the word of or it can appear with other words in between. Often the words have small typos in them or other signes like - or 's, which would e.g. affect the matching of regular expressions.

recent posts

2021-10-07 in maker space
3D printed drones
2019-03-10 in maker space
3D prints for a loving home
2019-02-27 in off topic
Determining the total revenue of a blackmailer: Bitcoin is offering new possiblities
2019-02-21 in photography
Skyline Frankfurt
2019-02-15 in maker space
Tiny Core - a very small linux distribution for the Raspberry PI (piCore)

about me

just me

I am a PhD-candidate from Bielefeld, Germany. My research topics are machine learning and human-robot cooperation. More precisely I work on active learning and cooperative intelligence in a human-robot teaching setting. If you are interested in my research, you can check out my publications here.