# Phonetic String Comparison with Soundex

By Jonathan Wood on 1/14/2011
 Language: C# Technology: .NET Platform: Windows License: CPOL Views: 32,721
 General Programming » Algorithms » Text Algorithms » Phonetic String Comparison with Soundex  Download Source Code

## Introduction

Computers are very good at being precise. Writing code to compare strings is easy and the code will find the smallest differences between those strings.

But what if you want the comparisons to be a little less precise? For example, what if you want to find a string but you're not completely sure about the spelling? In this case, it can be useful to be able to compare strings based on how similar they sound, even though the spelling may be different.

## The Soundex Algorithm

Soundex was developed by Robert C. Russell and Margaret K. Odell in 1918. It allowed the comparison of names that may sound similar but were spelled different. The rules for Soundex were simple.

1. The first letter of the word is the letter of the Soundex code, and is not coded to a number.
2. Replace consonants with digits as follows (after the first letter):
• b, f, p, v => 1
• c, g, j, k, q, s, x, z => 2
• d, t => 3
• l => 4
• m, n => 5
• r => 6
• h, w are not coded
3. Two adjacent letters with the same number are coded as a single number. Letters with the same number separated by an h or w are also coded as a single number.
4. Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

Listing 1 shows my Soundex class implemented in C#. It contains one public method, Encode(). This is a static method that encodes a string according to the Soundex rules.

Listing 1: Soundex Class

```/// <summary>
/// Classic soundex algorithm
/// </summary>
class Soundex
{
//  ABCDEFGHIJKLMNOPQRSTUVWXYZ
private const string _values = "01230120022455012623010202";
private const int EncodingLength = 4;

/// <summary>
/// Soundex-encodes the given text
/// </summary>
/// <param name="text">Text to be encoded</param>
/// <returns></returns>
public static string Encode(string text)
{
char prevChar = ' ';

// Normalize input
text = Normalize(text);
if (text.Length == 0)
return text;

// Write result to StringBuilder
StringBuilder builder = new StringBuilder();
builder.Append(text);
for (int i = 1;
i < text.Length && builder.Length < EncodingLength;
i++)
{
// Get digit for this letter
char c = _values[text[i] - 'A'];

// Add if not zero or same as last character
if (c != '0' && c != prevChar)
{
builder.Append(c);
prevChar = c;
}
}

while (builder.Length < EncodingLength)
builder.Append('0');

// Return result
return builder.ToString();
}

/// <summary>
/// Normalizes the given string by removing characters
/// that are not letters and converting the result to
/// upper case.
/// </summary>
/// <param name="text">Text to be normalized</param>
/// <returns></returns>
protected static string Normalize(string text)
{
StringBuilder builder = new StringBuilder();
foreach (char c in text)
{
if (Char.IsLetter(c))
builder.Append(Char.ToUpper(c));
}
return builder.ToString();
}
}
```

## The Metaphone Algorithm

The English language is rather complex and inconsistent. While Soundex is useful, it certainly has it's deficiencies. In 1990, Lawrence Philips developed the Metaphone algorithm to address some of these deficiencies.

Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY. The '0' represents "th", 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.

1. Drop duplicate adjacent letters, except for C.
2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
3. Drop 'B' if after 'M' and if it is at the end of the word.
4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. Reduce 'GG' to 'G'.
8. Drop 'H' if after vowel and not before a vowel.
9. 'CK' transforms to 'K'.
10. 'PH' transforms to 'F'.
11. 'Q' transforms to 'K'.
12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
14. 'V' transforms to 'F'.
15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
17. Drop 'Y' if not followed by a vowel.
18. 'Z' transforms to 'S'.
19. Drop all vowels unless it is the beginning.

Metaphone uses a more complex set of rules to provide more accurate phonetic comparisons. Listing 2 shows my Metaphone class. Since it is more complex than the Soundex class and needs to track its state in variables, the Encode() method is not static and the class must be instantiated in order to use.

Listing 2: Metaphone Class

```/// <summary>
/// Implements the Metaphone algorithm
/// </summary>
class Metaphone
{
// Constants
protected const int MaxEncodedLength = 6;
protected const char NullChar = (char)0;
protected const string Vowels = "AEIOU";

// For tracking position within current string
protected string _text;
protected int _pos;

/// <summary>
/// Encodes the given text using the Metaphone algorithm.
/// </summary>
/// <param name="text">Text to encode</param>
/// <returns></returns>
public string Encode(string text)
{
// Process normalized text
InitializeText(Normalize(text));

// Write encoded string to StringBuilder
StringBuilder builder = new StringBuilder();

// Special handling of some string prefixes:
//     PN, KN, GN, AE, WR, WH and X
switch (Peek())
{
case 'P':
case 'K':
case 'G':
if (Peek(1) == 'N')
break;

case 'A':
if (Peek(1) == 'E')
break;

case 'W':
if (Peek(1) == 'R')
else if (Peek(1) == 'H')
{
builder.Append('W');
}
break;

case 'X':
builder.Append('S');
break;
}

//
while (!EndOfText && builder.Length < MaxEncodedLength)
{
// Cache this character
char c = Peek();

// Ignore duplicates except CC
if (c == Peek(-1) && c != 'C')
{
continue;
}

// Don't change F, J, L, M, N, R or first-letter vowel
if (IsOneOf(c, "FJLMNR") ||
(builder.Length == 0 && IsOneOf(c, Vowels)))
{
builder.Append(c);
}
else
{
int charsConsumed = 1;

switch (c)
{
case 'B':
// B = 'B' if not -MB
if (Peek(-1) != 'M' || Peek(1) != NullChar)
builder.Append('B');
break;

case 'C':
// C = 'X' if -CIA- or -CH-
// Else 'S' if -CE-, -CI- or -CY-
// Else 'K' if not -SCE-, -SCI- or -SCY-
if (Peek(-1) != 'S' || !IsOneOf(Peek(1), "EIY"))
{
if (Peek(1) == 'I' && Peek(2) == 'A')
builder.Append('X');
else if (IsOneOf(Peek(1), "EIY"))
builder.Append('S');
else if (Peek(1) == 'H')
{
if ((_pos == 0 && !IsOneOf(Peek(2), Vowels)) ||
Peek(-1) == 'S')
builder.Append('K');
else
builder.Append('X');
charsConsumed++;    // Eat 'CH'
}
else builder.Append('K');
}
break;

case 'D':
// D = 'J' if DGE, DGI or DGY
// Else 'T'
if (Peek(1) == 'G' && IsOneOf(Peek(2), "EIY"))
builder.Append('J');
else
builder.Append('T');
break;

case 'G':
// G = 'F' if -GH and not B--GH, D--GH, -H--GH, -H---GH
// Else dropped if -GNED, -GN, -DGE-, -DGI-, -DGY-
// Else 'J' if -GE-, -GI-, -GY- and not GG
// Else K
if ((Peek(1) != 'H' || IsOneOf(Peek(2), Vowels)) &&
(Peek(1) != 'N' || (Peek(1) != NullChar &&
(Peek(2) != 'E' || Peek(3) != 'D'))) &&
(Peek(-1) != 'D' || !IsOneOf(Peek(1), "EIY")))
{
if (IsOneOf(Peek(1), "EIY") && Peek(2) != 'G')
builder.Append('J');
else
builder.Append('K');
}
// Eat GH
if (Peek(1) == 'H')
charsConsumed++;
break;

case 'H':
// H = 'H' if before or not after vowel
if (!IsOneOf(Peek(-1), Vowels) || IsOneOf(Peek(1), Vowels))
builder.Append('H');
break;

case 'K':
// K = 'C' if not CK
if (Peek(-1) != 'C')
builder.Append('K');
break;

case 'P':
// P = 'F' if PH
// Else 'P'
if (Peek(1) == 'H')
{
builder.Append('F');
charsConsumed++;    // Eat 'PH'
}
else
builder.Append('P');
break;

case 'Q':
// Q = 'K'
builder.Append('K');
break;

case 'S':
// S = 'X' if SH, SIO or SIA
// Else 'S'
if (Peek(1) == 'H')
{
builder.Append('X');
charsConsumed++;    // Eat 'SH'
}
else if (Peek(1) == 'I' && IsOneOf(Peek(2), "AO"))
builder.Append('X');
else
builder.Append('S');
break;

case 'T':
// T = 'X' if TIO or TIA
// Else '0' if TH
// Else 'T' if not TCH
if (Peek(1) == 'I' && IsOneOf(Peek(2), "AO"))
builder.Append('X');
else if (Peek(1) == 'H')
{
builder.Append('0');
charsConsumed++;    // Eat 'TH'
}
else if (Peek(1) != 'C' || Peek(2) != 'H')
builder.Append('T');
break;

case 'V':
// V = 'F'
builder.Append('F');
break;

case 'W':
case 'Y':
// W,Y = Keep if not followed by vowel
if (IsOneOf(Peek(1), Vowels))
builder.Append(c);
break;

case 'X':
// X = 'S' if first character (already done)
// Else 'KS'
builder.Append("KS");
break;

case 'Z':
// Z = 'S'
builder.Append('S');
break;
}
}
}
// Return result
return builder.ToString();
}

/// <summary>
///
/// </summary>
/// <param name="text"></param>
protected void InitializeText(string text)
{
_text = text;
_pos = 0;
}

/// <summary>
/// Indicates if the current position is at the end of
/// the text.
/// </summary>
protected bool EndOfText
{
get { return _pos >= _text.Length; }
}

/// <summary>
/// Moves the current position ahead one character.
/// </summary>
{
}

/// <summary>
/// Moves the current position ahead the specified number.
/// of characters.
/// </summary>
/// <param name="count">Number of characters to move
{
_pos = Math.Min(_pos + count, _text.Length);
}

/// <summary>
/// Returns the character at the current position.
/// </summary>
/// <returns></returns>
protected char Peek()
{
return Peek(0);
}

/// <summary>
/// Returns the character at the specified position.
/// </summary>
/// to the current position.</param>
/// <returns></returns>
{
int pos = (_pos + ahead);
if (pos < 0 || pos >= _text.Length)
return NullChar;
return _text[pos];
}

/// <summary>
/// Indicates if the specified character occurs within
/// the specified string.
/// </summary>
/// <param name="c">Character to find</param>
/// <param name="chars">String to search</param>
/// <returns></returns>
protected bool IsOneOf(char c, string chars)
{
return (chars.IndexOf(c) != -1);
}

/// <summary>
/// Normalizes the given string by removing characters
/// that are not letters and converting the result to
/// upper case.
/// </summary>
/// <param name="text">Text to be normalized</param>
/// <returns></returns>
protected string Normalize(string text)
{
StringBuilder builder = new StringBuilder();
foreach (char c in text)
{
if (Char.IsLetter(c))
builder.Append(Char.ToUpper(c));
}
return builder.ToString();
}
}
```

As you can see, Metaphone is quite a bit more complex than Soundex. Since it's more complex, it doesn't run as fast. But it's quite a bit more accurate than Soundex.

## Conclusion

As previously mentioned, the English language does not follow a simple set of rules. It can be even more complex when you're working with names since names may originate from countries that speak different languages. And, of course, to use these algorithms with other languages would require a complete overhaul.

As a result, Metaphone has its deficiencies as well. Endless variations of this algorithm have been developed including yet another algorithm from Philips called Double Metaphone. If the code I've presented doesn't quite meet your needs, you can try and tweak it or research some of the other algorithms that have been developed.

The attached download includes the source code and a sample program that will search a list of words using either algorithm. Jonathan Wood