# ML Algorithms: Levenshtein

## Introduction

The purpose of the Levenshtein algorithm is to provide a measure of similarity between words.
This measurement of *edit distance* is determined by calculating the number of additions, subtractions,
and substitutions of letters needed for a word to transform to another. If we wanted to determine the edit distance
between the word `rotten`

and the word `rotting`

we could first perform the following operations:

rotten -> rottin (substitute e for i) rottin -> rotting (add g)

If we assign costs to the different operations: addition, subtraction, and substitution, we can then calculate a score for this transformation.

Let say we assign the following cost values:

Addition: 1 Subtraction: 1 Substitution: 1

The edit distance between `rotten`

and `rotting`

would be 2.

As there could be many possible operations to transform one word to another, Levenshtein specifically calculates the lowest edit distance possible.

Our example becomes clearer when we show the words in a matrix. By following along the diagonal we see that we stay at 0 until the `e`

needs to be
substituted for `i`

and then at the end we shift to the right to add our `g`

, this lower right value being our lowest possible score, tells us
our Levenshtein edit distance.

r o t t i n g 0 1 2 3 4 5 6 7 r 1 0.0 1.0 2.0 3.0 4.0 5.0 6.0 o 2 1.0 0.0 1.0 2.0 3.0 4.0 5.0 t 3 2.0 1.0 0.0 1.0 2.0 3.0 4.0 t 4 3.0 2.0 1.0 0.0 1.0 2.0 3.0 e 5 4.0 3.0 2.0 1.0 1.0 2.0 3.0 n 6 5.0 4.0 3.0 2.0 2.0 1.0 2.0

## Implementation

The Python implementation of Levenshtein can be seen below:

#!/usr/bin/python2.7 # -*- coding: utf-8 -*- # This levenshtein function takes two strings and can additionally # take parameters for insert, delete and substitution cost. # returns a list containing the edit distance, normalized edit distance and the matrix built def levenshtein(source, targ, insert_cost = 1.0, delete_cost = 1.0, sub_cost = 1.0): l1 = len(source) l2 = len(targ) if l1 < l2: return levenshtein(targ, source, insert_cost, delete_cost, sub_cost) matrix = [range(l1 + 1)] * (l2 + 1) for x in range(l2 + 1): matrix[x] = range(x,x + l1 + 1) for j in range(0,l2): for i in range(0,l1): if source[i] == targ[j]: substitution_cost = 0.0 else: substitution_cost = sub_cost matrix[j+1][i+1] = min(matrix[j+1][i] + insert_cost, matrix[j][i+1] + delete_cost, matrix[j][i] + substitution_cost) # Return our edit distance (last row, last column), # the normailize edit distance (divded by sum of source and target lengths), # and additionally lets return the matrix we built so that we can # pass it to display functions if we want. return [matrix[l2][l1], matrix[l2][l1] / (l1 + l2), matrix] if __name__ == '__main__': results = levenshtein('rotten', 'rotting') print "Edit Distance for 'rotting' and 'rotten': " + str(results[0]) print "Normalized Edit Distance for 'rotting' and 'rotten': " + str(results[1]) + "\n"

As we can see from our example matrix, our Levenshtein function is building a similar matrix and then working its way through the letters of the two words determining the minimum cost among the three operations for the two letters in question as it fills out the matrix.

For a more detailed example implementation with functions to produce nice color coded matrix outputs see the following link.

Disclaimer: This example, while a working example, was not created for speed, but rather to adequately explain the algorithm in question. You can find speedy implementations (usually in C) available for most languages with a quick Google search.

## Use Cases

One of the common use cases for using Levenshtein would be spell checking. One could imagine using a word dictionary and finding Levenshtein edit distances against a particular word to produce possible corrections. Variants of this could be used for other fuzzy string matching needs, for example to help in translation tasks.

As stated in the beginning of this post, Levenshtein edit distance is a measure of similarity between words and this can be helpful in itself as we progress in our Machine Learning Algorithm series, we will see how this score could potentially be a feature for a classifier or used in some other means as a metric in Machine Learning/NLP tasks.