John Stack
John Stack

Reputation: 628

Approximate String Comparison

Given strings with some junk in them: "The cat is black"; "The cat is blue"; "The cat is white"; The yellow dog is not a cat";

We need to get strings that are close approximations. In the above example, all the strings referring to the cat are approximate enough for our business case and the last should be discarded.

We thought that adding up all of the decimal values of the characters in the string and if they were within a given range; say +/- 350, then we would say the string is approximate.

Are there other ways to do this? We're using C# in Visual Studio.

I have seen the Levenshtein stuff but was wondering if there were any other perhaps less scientific methods to do so.

https://code.google.com/p/pylevenshtein/

Upvotes: 1

Views: 181

Answers (5)

John Stack
John Stack

Reputation: 628

Ultimately, we ended up doing a massive comparison between Soundex sample code with a dictionary we built and tinstaafl's sample found in pastebin.

We then had to build an additional ruleset that cycled through all exceptions to improve our odds. Neither calculation did it all correctly.

We had considered running everything through a voice converter and back - simply because the client wished to see what might happen; however, we talked them out of it.

Upvotes: 1

tinstaafl
tinstaafl

Reputation: 6948

Here's Levenshtein algorithm written C#, that might do the trick

Upvotes: 2

Wonderbird
Wonderbird

Reputation: 374

Why reinvent the wheel. Is there any reason the "soundex" algorithm would not meet your needs. You can probably find several existing implementations that may meet your needs.

Upvotes: 1

evanmcdonnal
evanmcdonnal

Reputation: 48076

Well here are a couple of not very scientific but relatively simple ways of saying how similar two strings are, the first is just to see how many common words they have.

    string[] words1 = inputString.Split(" ");
    string[] words2 = otherInputString.Split(" ");


    int diff = words1.Intersect(words2).Count();

From here you could do something like diff / words1.Count() to get a percent difference.

Another option since all of your 'like' strings have common substrings until the final word (the color of the cat) would be to use longestCommonSubstring.Length / inputString.Length so say they are X% similar. You can get the longest common substring with something like;

public static string LongestCommonSubstring(List<string> strings)
{
    var firstString = strings.FirstOrDefault();

    var allSubstrings = new List<string>();
    for(int substringLength = firstString.Length -1; substringLength >0; substringLength--)
    {
        for(int offset = 0; (substringLength + offset) < firstString.Length; offset++)
        {
            string currentSubstring = firstString.Substring(offset,substringLength);
            if (!System.String.IsNullOrWhiteSpace(currentSubstring) && !allSubstrings.Contains(currentSubstring))
            {
                allSubstrings.Add(currentSubstring);
            }
        }
    }

    return allSubstrings.OrderBy(subStr => subStr.Length).FirstOrDefault();
}

You would call this like so;

  string subString = LongestCommonSubString(new List<string> { inputstring, otherInputString } );

Yet another option would be to use a diff library. I've used Googles diff-match-patch library in the past and was happy with it. I'm not going to post sample code for that because I don't have anything using it atm but the provides examples should suffice if you go that route.

Upvotes: 3

Mike Johnson
Mike Johnson

Reputation: 721

By combining the use of methods from the System.String class with clever use of regular expressions, I think you could achieve this without the need for a specialized library for the functionality you are looking for.

Regular expressions run fast enough but methods from the String class can get heavy quick if you aren't careful (too many string splits, etc).

I'm no regex wizard but you could build up regular expressions on demand (at runtime) based on either accepted input, or expected input, and use them to detect close approximations.

Upvotes: 1

Related Questions