Login


Phonetic String Comparison with Soundex

By Jonathan Wood on 1/14/2011
Language: C#
Technology: .NET
Platform: Windows
License: CPOL
Views: 37,893
General Programming » Algorithms » Text Algorithms » Phonetic String Comparison with Soundex

Demo Project Screenshot

Download Source Code Download Source Code

Introduction

Computers are very good at being precise. Writing code to compare strings is easy and the code will find the smallest differences between those strings.

But what if you want the comparisons to be a little less precise? For example, what if you want to find a string but you're not completely sure about the spelling? In this case, it can be useful to be able to compare strings based on how similar they sound, even though the spelling may be different.

The Soundex Algorithm

Soundex was developed by Robert C. Russell and Margaret K. Odell in 1918. It allowed the comparison of names that may sound similar but were spelled different. The rules for Soundex were simple.

  1. The first letter of the word is the letter of the Soundex code, and is not coded to a number.
  2. Replace consonants with digits as follows (after the first letter):
    • b, f, p, v => 1
    • c, g, j, k, q, s, x, z => 2
    • d, t => 3
    • l => 4
    • m, n => 5
    • r => 6
    • h, w are not coded
  3. Two adjacent letters with the same number are coded as a single number. Letters with the same number separated by an h or w are also coded as a single number.
  4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

Listing 1 shows my Soundex class implemented in C#. It contains one public method, Encode(). This is a static method that encodes a string according to the Soundex rules.

Listing 1: Soundex Class

/// <summary>
/// Classic soundex algorithm
/// </summary>
class Soundex
{
                                //  ABCDEFGHIJKLMNOPQRSTUVWXYZ
    private const string _values = "01230120022455012623010202";
    private const int EncodingLength = 4;

    /// <summary>
    /// Soundex-encodes the given text
    /// </summary>
    /// <param name="text">Text to be encoded</param>
    /// <returns></returns>
    public static string Encode(string text)
    {
        char prevChar = ' ';

        // Normalize input
        text = Normalize(text);
        if (text.Length == 0)
            return text;

        // Write result to StringBuilder
        StringBuilder builder = new StringBuilder();
        builder.Append(text[0]);
        for (int i = 1;
            i < text.Length && builder.Length < EncodingLength;
            i++)
        {
            // Get digit for this letter
            char c = _values[text[i] - 'A'];

            // Add if not zero or same as last character
            if (c != '0' && c != prevChar)
            {
                builder.Append(c);
                prevChar = c;
            }
        }

        // Pad with trailing zeros
        while (builder.Length < EncodingLength)
            builder.Append('0');

        // Return result
        return builder.ToString();
    }

    /// <summary>
    /// Normalizes the given string by removing characters
    /// that are not letters and converting the result to
    /// upper case.
    /// </summary>
    /// <param name="text">Text to be normalized</param>
    /// <returns></returns>
    protected static string Normalize(string text)
    {
        StringBuilder builder = new StringBuilder();
        foreach (char c in text)
        {
            if (Char.IsLetter(c))
                builder.Append(Char.ToUpper(c));
        }
        return builder.ToString();
    }
}

The Metaphone Algorithm

The English language is rather complex and inconsistent. While Soundex is useful, it certainly has it's deficiencies. In 1990, Lawrence Philips developed the Metaphone algorithm to address some of these deficiencies.

Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY. The '0' represents "th", 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.

  1. Drop duplicate adjacent letters, except for C.
  2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
  3. Drop 'B' if after 'M' and if it is at the end of the word.
  4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
  5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
  6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
  7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. Reduce 'GG' to 'G'.
  8. Drop 'H' if after vowel and not before a vowel.
  9. 'CK' transforms to 'K'.
  10. 'PH' transforms to 'F'.
  11. 'Q' transforms to 'K'.
  12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
  13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
  14. 'V' transforms to 'F'.
  15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
  16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
  17. Drop 'Y' if not followed by a vowel.
  18. 'Z' transforms to 'S'.
  19. Drop all vowels unless it is the beginning.

Metaphone uses a more complex set of rules to provide more accurate phonetic comparisons. Listing 2 shows my Metaphone class. Since it is more complex than the Soundex class and needs to track its state in variables, the Encode() method is not static and the class must be instantiated in order to use.

Listing 2: Metaphone Class

/// <summary>
/// Implements the Metaphone algorithm
/// </summary>
class Metaphone
{
    // Constants
    protected const int MaxEncodedLength = 6;
    protected const char NullChar = (char)0;
    protected const string Vowels = "AEIOU";

    // For tracking position within current string
    protected string _text;
    protected int _pos;

    /// <summary>
    /// Encodes the given text using the Metaphone algorithm.
    /// </summary>
    /// <param name="text">Text to encode</param>
    /// <returns></returns>
    public string Encode(string text)
    {
        // Process normalized text
        InitializeText(Normalize(text));

        // Write encoded string to StringBuilder
        StringBuilder builder = new StringBuilder();

        // Special handling of some string prefixes:
        //     PN, KN, GN, AE, WR, WH and X
        switch (Peek())
        {
            case 'P':
            case 'K':
            case 'G':
                if (Peek(1) == 'N')
                    MoveAhead();
                break;

            case 'A':
                if (Peek(1) == 'E')
                    MoveAhead();
                break;

            case 'W':
                if (Peek(1) == 'R')
                    MoveAhead();
                else if (Peek(1) == 'H')
                {
                    builder.Append('W');
                    MoveAhead(2);
                }
                break;

            case 'X':
                builder.Append('S');
                MoveAhead();
                break;
        }

        //
        while (!EndOfText && builder.Length < MaxEncodedLength)
        {
            // Cache this character
            char c = Peek();

            // Ignore duplicates except CC
            if (c == Peek(-1) && c != 'C')
            {
                MoveAhead();
                continue;
            }

            // Don't change F, J, L, M, N, R or first-letter vowel
            if (IsOneOf(c, "FJLMNR") ||
                (builder.Length == 0 && IsOneOf(c, Vowels)))
            {
                builder.Append(c);
                MoveAhead();
            }
            else
            {
                int charsConsumed = 1;

                switch (c)
                {
                    case 'B':
                        // B = 'B' if not -MB
                        if (Peek(-1) != 'M' || Peek(1) != NullChar)
                            builder.Append('B');
                        break;

                    case 'C':
                        // C = 'X' if -CIA- or -CH-
                        // Else 'S' if -CE-, -CI- or -CY-
                        // Else 'K' if not -SCE-, -SCI- or -SCY-
                        if (Peek(-1) != 'S' || !IsOneOf(Peek(1), "EIY"))
                        {
                            if (Peek(1) == 'I' && Peek(2) == 'A')
                                builder.Append('X');
                            else if (IsOneOf(Peek(1), "EIY"))
                                builder.Append('S');
                            else if (Peek(1) == 'H')
                            {
                                if ((_pos == 0 && !IsOneOf(Peek(2), Vowels)) ||
                                    Peek(-1) == 'S')
                                    builder.Append('K');
                                else
                                    builder.Append('X');
                                charsConsumed++;    // Eat 'CH'
                            }
                            else builder.Append('K');
                        }
                        break;

                    case 'D':
                        // D = 'J' if DGE, DGI or DGY
                        // Else 'T'
                        if (Peek(1) == 'G' && IsOneOf(Peek(2), "EIY"))
                            builder.Append('J');
                        else
                            builder.Append('T');
                        break;

                    case 'G':
                        // G = 'F' if -GH and not B--GH, D--GH, -H--GH, -H---GH
                        // Else dropped if -GNED, -GN, -DGE-, -DGI-, -DGY-
                        // Else 'J' if -GE-, -GI-, -GY- and not GG
                        // Else K
                        if ((Peek(1) != 'H' || IsOneOf(Peek(2), Vowels)) &&
                            (Peek(1) != 'N' || (Peek(1) != NullChar &&
                            (Peek(2) != 'E' || Peek(3) != 'D'))) &&
                            (Peek(-1) != 'D' || !IsOneOf(Peek(1), "EIY")))
                        {
                            if (IsOneOf(Peek(1), "EIY") && Peek(2) != 'G')
                                builder.Append('J');
                            else
                                builder.Append('K');
                        }
                        // Eat GH
                        if (Peek(1) == 'H')
                            charsConsumed++;
                        break;

                    case 'H':
                        // H = 'H' if before or not after vowel
                        if (!IsOneOf(Peek(-1), Vowels) || IsOneOf(Peek(1), Vowels))
                            builder.Append('H');
                        break;

                    case 'K':
                        // K = 'C' if not CK
                        if (Peek(-1) != 'C')
                            builder.Append('K');
                        break;

                    case 'P':
                        // P = 'F' if PH
                        // Else 'P'
                        if (Peek(1) == 'H')
                        {
                            builder.Append('F');
                            charsConsumed++;    // Eat 'PH'
                        }
                        else
                            builder.Append('P');
                        break;

                    case 'Q':
                        // Q = 'K'
                        builder.Append('K');
                        break;

                    case 'S':
                        // S = 'X' if SH, SIO or SIA
                        // Else 'S'
                        if (Peek(1) == 'H')
                        {
                            builder.Append('X');
                            charsConsumed++;    // Eat 'SH'
                        }
                        else if (Peek(1) == 'I' && IsOneOf(Peek(2), "AO"))
                            builder.Append('X');
                        else
                            builder.Append('S');
                        break;

                    case 'T':
                        // T = 'X' if TIO or TIA
                        // Else '0' if TH
                        // Else 'T' if not TCH
                        if (Peek(1) == 'I' && IsOneOf(Peek(2), "AO"))
                            builder.Append('X');
                        else if (Peek(1) == 'H')
                        {
                            builder.Append('0');
                            charsConsumed++;    // Eat 'TH'
                        }
                        else if (Peek(1) != 'C' || Peek(2) != 'H')
                            builder.Append('T');
                        break;

                    case 'V':
                        // V = 'F'
                        builder.Append('F');
                        break;

                    case 'W':
                    case 'Y':
                        // W,Y = Keep if not followed by vowel
                        if (IsOneOf(Peek(1), Vowels))
                            builder.Append(c);
                        break;

                    case 'X':
                        // X = 'S' if first character (already done)
                        // Else 'KS'
                        builder.Append("KS");
                        break;

                    case 'Z':
                        // Z = 'S'
                        builder.Append('S');
                        break;
                }
                // Advance over consumed characters
                MoveAhead(charsConsumed);
            }
        }
        // Return result
        return builder.ToString();
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="text"></param>
    protected void InitializeText(string text)
    {
        _text = text;
        _pos = 0;
    }

    /// <summary>
    /// Indicates if the current position is at the end of
    /// the text.
    /// </summary>
    protected bool EndOfText
    {
        get { return _pos >= _text.Length; }
    }

    /// <summary>
    /// Moves the current position ahead one character.
    /// </summary>
    void MoveAhead()
    {
        MoveAhead(1);
    }

    /// <summary>
    /// Moves the current position ahead the specified number.
    /// of characters.
    /// </summary>
    /// <param name="count">Number of characters to move
    /// ahead.</param>
    void MoveAhead(int count)
    {
        _pos = Math.Min(_pos + count, _text.Length);
    }

    /// <summary>
    /// Returns the character at the current position.
    /// </summary>
    /// <returns></returns>
    protected char Peek()
    {
        return Peek(0);
    }

    /// <summary>
    /// Returns the character at the specified position.
    /// </summary>
    /// <param name="ahead">Position to read relative
    /// to the current position.</param>
    /// <returns></returns>
    protected char Peek(int ahead)
    {
        int pos = (_pos + ahead);
        if (pos < 0 || pos >= _text.Length)
            return NullChar;
        return _text[pos];
    }

    /// <summary>
    /// Indicates if the specified character occurs within
    /// the specified string.
    /// </summary>
    /// <param name="c">Character to find</param>
    /// <param name="chars">String to search</param>
    /// <returns></returns>
    protected bool IsOneOf(char c, string chars)
    {
        return (chars.IndexOf(c) != -1);
    }

    /// <summary>
    /// Normalizes the given string by removing characters
    /// that are not letters and converting the result to
    /// upper case.
    /// </summary>
    /// <param name="text">Text to be normalized</param>
    /// <returns></returns>
    protected string Normalize(string text)
    {
        StringBuilder builder = new StringBuilder();
        foreach (char c in text)
        {
            if (Char.IsLetter(c))
                builder.Append(Char.ToUpper(c));
        }
        return builder.ToString();
    }
}

As you can see, Metaphone is quite a bit more complex than Soundex. Since it's more complex, it doesn't run as fast. But it's quite a bit more accurate than Soundex.

Conclusion

As previously mentioned, the English language does not follow a simple set of rules. It can be even more complex when you're working with names since names may originate from countries that speak different languages. And, of course, to use these algorithms with other languages would require a complete overhaul.

As a result, Metaphone has its deficiencies as well. Endless variations of this algorithm have been developed including yet another algorithm from Philips called Double Metaphone. If the code I've presented doesn't quite meet your needs, you can try and tweak it or research some of the other algorithms that have been developed.

The attached download includes the source code and a sample program that will search a list of words using either algorithm.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.