Smith Waterman Distance for feature extraction in NLP

I recently competed in a http://hackerrank.com competition. The task was to classify text with multi-labels. Therefore, I started with a basic bag of words approach, which performed quite good. After analyzing the data a bit, I realized that some keywords came up in slightly different representation - which for bag of words is a bit unfavorable. E.g. the keyword years of experience consist of 3 words which aren't handled with a bag of words approach because BOW don't know that these 3 words belong together. You can use ngrams to compensate that a bit, but my experience showed, that this fails most of the times. Also, this example can also appear without the word of or it can appear with other words in between. Often the words have small typos in them or other signes like - or 's, which would e.g. affect the matching of regular expressions.

recent posts

2019-02-27 in off topic
Determining the total revenue of a blackmailer: Bitcoin is offering new possiblities
2019-02-21 in photography
Skyline Frankfurt
2019-02-19 in photography
Old train station Löhne
2019-02-15 in maker space
Tiny Core - a very small linux distribution for the Raspberry PI (piCore)
2019-02-13 in data science
Smith Waterman Distance for feature extraction in NLP

about me

I am a PhD-student from Frankfurt in Germany. My research topics are machine learning and human robot interaction. More precisely I work on active learning and cooperative intelligence in task completion right now.

In my free time I like to make things. Some of them are tech related, but others are not. I am also interested in programming, web-programming, robotics and drones, IoT, photography and making delicious filter coffee or espresso. Last but not least I am a big fan and collector of retro video games, especially N64 and SNES. You can take a look at my complete collection.

just me