Fuzzy String Matching

Fuzzy matching and confidence levels is what this exercise is all about. It is tough to match two strings and say that they are quite similar, but not exact. There are a few ways you can achieve this goal.

Levenshtein Distance: This calculates the minimum number of insertions, deletions, and substitutions necessary to convert one string into another. A low distance between two strings means that the strings are more similar. The best site I have found is Levenshtein Distance, in Three Flavors.

I have modified their algorithm and created C and FoxPro versions. My methods do not use a huge matrix – they just use a one-dimensional array the same length as one of the strings. They also keep the values in the array incremented by 1 so that the comparisons in the loop do not need to perform additional math. The goal was to tweak the loop and try to keep math to a minimum in there.

Gestalt: I stumbled across this algorithm in PHP's documentation about the similar_text function. The best source for the algorithm that I found was in PHP's source code for the string functions. Look for the php_similar_str, php_similar_char, and PHP_FUNCTION(similar_text) functions.

I have created C and FoxPro versions of the code. They are both recursive, so be careful with large strings on limited devices. Also, Eduardo Curtolo <> provided a Pascal version.

SoundEx: This algorithm is used on your driver's licence (in the U.S.). Its goal is to group letters that sound alike, then convert the name into a series of numbers that can represent the name. Understanding Classic SoundEx Algorithms provides a very nice description of how SoundEx is used and generated. Taking the concept one step further, you could read A Better Phonetic Lookup and get an algorithm that matches really well, based on how the language works.

An optimist is a guy that has never had much experience.
– Don Marquis
Tyler Akins! <>
Contact Me - Legal Info