Login


Approximate String Comparisons Using Levenshtein Distance

By Jonathan Wood on 2/27/2011
Language: C#
Technology: .NET
Platform: Windows
License: CPOL
Views: 19,183
General Programming » Algorithms » Text Algorithms » Approximate String Comparisons Using Levenshtein Distance

Screenshot of Test Project

Download Source Code and Test Project Download Source Code and Test Project

Introduction

While computers are great at testing whether or not two strings match exactly, we sometimes have a need to compare strings that are not identical. For example, a spell checker program must look for words that are similar to those that the user may have misspelled.

I previously wrote about the Soundex and Metaphone algorithms. These algorithms allow you to test for words that sound similar. In this article, I'll discuss the Levenshtein algorithm, which is used to compare the edit distance between two strings.

The edit distance, or Levenshtein distance, is the number of edits required to make one of the strings the same as the other. For example, if we have the strings "Tuesday" and "Thursday", the Levenshtein distance is 2. To make "Tuesday" the same as "Thursday", two edits are required: Insert an 'h' and insert an 'r'.

In fact, the Levenshtein distance algorithm can be used in conjunction with the Soundex and Metephone algorithms. Since the two latter algorithms return a string of characters that represent how a string sounds, the former algorithm could be used to measure how similar two strings sound alike.

The EditDistance Class

Listing 1 shows my EditDistance class. This class has two public methods, which are overloaded versions of the Compute() method. The Compute() method takes two strings and returns an integer that reflects the edit distance between the two strings.

By default, this method is case-sensitive. But it is overloaded with a version that also accepts an ignoreCase argument. If ignoreCase is true, the code considers two characters equal when they are different cases of the same character. The method uses a delegate to refer to the appropriate character comparison method (determined by whether or not the comparison should be case-sensitive). This approach eliminates the need to test the ignoreCase argument every time through the loop.

You should be aware that the method starts by allocating a matrix buffer with a.Length + 1 x b.Length + 1 elements. For larger strings, you may consider this a substantial amount of memory. It is possible to modify the algorithm so that it doesn't require the entire matrix be in memory all at once. But with the strings I used in my testing, the amount of memory was really not a concern. You could always change this matrix to use a smaller data type than int if you are sure the edit distance will never exceed its range.

Listing 1: The EditDistance Class

/// <summary>
/// Class to compute the Levenshtein edit distance between two strings.
/// </summary>
class EditDistance
{
    private delegate bool CharComparer(char a, char b);

    /// <summary>
    /// Computes the Levenshtein distance between two strings.
    /// </summary>
    /// <param name="a">String 1</param>
    /// <param name="b">String 2</param>
    /// <returns></returns>
    public static int Compute(string a, string b)
    {
        return Compute(a, b, false);
    }

    /// <summary>
    /// Computes the Levenshtein distance between two strings.
    /// </summary>
    /// <param name="a">String 1</param>
    /// <param name="b">String 2</param>
    /// <param name="ignoreCase">Indicates if comparisions should
    /// be made without regard to case</param>
    /// <returns></returns>
    public static int Compute(string a, string b, bool ignoreCase)
    {
        // Allocate distance matrix
        int[,] d = new int[a.Length + 1, b.Length + 1];

        // Get character comparer
        CharComparer isEqual = (ignoreCase) ?
            (CharComparer)CharCompareIgnoreCase : CharCompare;

        // Compute distance
        for (int i = 0; i <= a.Length; i++)
            d[i, 0] = i;
        for (int j = 0; j <= b.Length; j++)
            d[0, j] = j;
        for (int i = 1; i <= a.Length; i++)
        {
            for (int j = 1; j <= b.Length; j++)
            {
                if (isEqual(a[i - 1], b[j - 1]))
                {
                    // No change required
                    d[i, j] = d[i - 1, j - 1];
                }
                else
                {
                    d[i, j] =
                        Math.Min(d[i - 1, j] + 1,    // Deletion
                        Math.Min(d[i, j - 1] + 1,    // Insertion
                        d[i - 1, j - 1] + 1));       // Substitution
                }
            }
        }

        // Return final value
        return d[a.Length, b.Length];
    }

    // Case-sensitive char comparer
    private static bool CharCompare(char a, char b)
    {
        return (a == b);
    }

    // Case-insensitive char comparer
    private static bool CharCompareIgnoreCase(char a, char b)
    {
        return (Char.ToLower(a) == Char.ToLower(b));
    }
}

Using the Code

Using the EditDistance class is simple enough. The two methods described previously are static so you do not need to declare an instance of the class. Simply call the Compute() method with the appropriate arguments.

The attached test project includes all the source code and demonstrates using the class.

Conclusion

I suspect other people will come up with uses for this class that I hadn't considered. Any time you want to measure the difference between two strings, or find which of a list of strings comes closest to a target string, this class would come in handy.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.