Hnus
Hnus

Reputation: 959

String similarity with varying lengths but same words

I know there are many string similarity algorithms but I don't know which would be best for my problem.

My strings vary in length but usually have some extra fluff added to one or the other I want algorithm to give high similarity "points" when strings contain same words without typos. For example Stuff and Things corp. is same as Stuff and Things corporation or 101, Stuff and Things corporat or Stuff and Things.

But strings color and colour, Loremipsum and Olremipsum in my case are totally different. My strings would never have characters which are mistyped or swapped also strings are 1 to 50 chars long.

EDIT: Order of same words is very importat New York city would be different or have low level of similarity with York New city

Thanks for any help

Upvotes: 4

Views: 452

Answers (1)

Tim Schmelter
Tim Schmelter

Reputation: 460038

Ok, even if the rules still aren't that clear i'll give it a try.

To summary your requirement:

  • Find the longest common word-sequence in another sentence
  • At least two words must be common, so New York and New Delhi are not equal
  • the order matters, so New York city and York New city are not equal

The method FindCommonWords returns a sequence of words that are common in both sentences or an empty sequence(Enumerable.Empty<string>) if no common word sequence was found.

It first splits both strings by a pre-defined list of word separators into two string[]. Then it checks all "sub-sequences" whether or not it is contained in the other array in the same order(with an extension method IndexOfSequence).

private static readonly char[] wordSeparators = { '\n', '\t', ',', '.', '!', '?', ';', ':', ' ', '-', '/', '\\', '[', ']', '(', ')', '<', '>', '@', '"', '\'' };

public static IEnumerable<string> FindCommonWords(string str1, string str2, StringComparer comparer = null)
{
    if (str1 == null)
        throw new ArgumentNullException("str1", "Both input strings must not be null!");
    if (str2 == null)
        throw new ArgumentNullException("str2", "Both input strings must not be null!");

    if (comparer == null) comparer = StringComparer.CurrentCulture;
    str1 = str1.Trim();
    str2 = str2.Trim();

    string[] words1 = str1.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
    string[] words2 = str2.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
    if(Math.Min(words1.Length, words2.Length) < 2)
        return Enumerable.Empty<string>(); // one word is not supposed to be a commnon word sequence

    // use for-loop to find the longest common words
    for (int wordCount = words1.Length - 1; wordCount >= 2; wordCount--)
    {
        // scan word-count from left to right
        for (int skipCount = 0; wordCount + skipCount <= words1.Length; skipCount++)
        {
            // take wordCount-words from left side and walk from left to right
            IEnumerable<string> wordSeq = words1.Skip(skipCount).Take(wordCount);
            // search sequence in other words
            int indexInWords2 = words2.IndexOfSequence(wordSeq, comparer);
            if (indexInWords2 >= 0)
            {
                // found match in other words, must be longest common sequence
                return wordSeq;
            }
        }
    }
    return Enumerable.Empty<string>();
}

Here's the extension which might also be useful for other requirements:

public static int IndexOfSequence<TSource>(this IEnumerable<TSource> input, IEnumerable<TSource> sequence, IEqualityComparer<TSource> comparer)
{
    if (input == null) throw new ArgumentNullException("input");
    if (sequence == null) throw new ArgumentNullException("sequence");
    if (!sequence.Any()) throw new ArgumentException("Sequence must not be empty", "sequence");
    if (comparer == null)
    {
        comparer = EqualityComparer<TSource>.Default;
    }
    int index = -1, firstIndex = -1, lastFoundIndex = -1;
    bool found = false;

    using (IEnumerator<TSource> enumerator = input.GetEnumerator())
    {
        using (IEnumerator<TSource> enumerator2 = sequence.GetEnumerator())
        {
            enumerator2.MoveNext();
            while (enumerator.MoveNext())
            {
                index++;
                found = comparer.Equals(enumerator.Current, enumerator2.Current);
                if (found && firstIndex == -1)
                    firstIndex = index;
                else if (found && index != lastFoundIndex + 1)
                    found = false; // sequence must be consecutive
                if (found && !enumerator2.MoveNext())
                    return firstIndex;
                if(found)
                    lastFoundIndex = index;
            }
        }
    }
    return -1;
}

Here are your three samples:

var commonWords = FindCommonWords(
     "Stuff and Things corporation", 
     "101, Stuff and Things corporat", 
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords));   // Stuff and Things

commonWords = FindCommonWords(
     "101, Stuff and Things corporat",
     "or Stuff and Things.",
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine( string.Join(" ", commonWords) ); // Stuff and Things

commonWords = FindCommonWords(
     "New York city",
     "York New city",
     StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords));  // empty sequence, no match

Note that it's written from the scratch and not tested thoroughly.

Upvotes: 4

Related Questions