Reputation: 959
I know there are many string similarity algorithms but I don't know which would be best for my problem.
My strings vary in length but usually have some extra fluff added to one or the other I want algorithm to give high similarity "points" when strings contain same words without typos. For example Stuff and Things corp.
is same as Stuff and Things corporation
or 101, Stuff and Things corporat
or Stuff and Things
.
But strings color
and colour
, Loremipsum
and Olremipsum
in my case are totally different. My strings would never have characters which are mistyped or swapped also strings are 1 to 50 chars long.
EDIT:
Order of same words is very importat New York city
would be different or have low level of similarity with York New city
Thanks for any help
Upvotes: 4
Views: 452
Reputation: 460038
Ok, even if the rules still aren't that clear i'll give it a try.
To summary your requirement:
New York
and New Delhi
are not equalNew York city
and York New city
are not equalThe method FindCommonWords
returns a sequence of words that are common in both sentences or an empty sequence(Enumerable.Empty<string>
) if no common word sequence was found.
It first splits both strings by a pre-defined list of word separators into two string[]
. Then it checks all "sub-sequences" whether or not it is contained in the other array in the same order(with an extension method IndexOfSequence
).
private static readonly char[] wordSeparators = { '\n', '\t', ',', '.', '!', '?', ';', ':', ' ', '-', '/', '\\', '[', ']', '(', ')', '<', '>', '@', '"', '\'' };
public static IEnumerable<string> FindCommonWords(string str1, string str2, StringComparer comparer = null)
{
if (str1 == null)
throw new ArgumentNullException("str1", "Both input strings must not be null!");
if (str2 == null)
throw new ArgumentNullException("str2", "Both input strings must not be null!");
if (comparer == null) comparer = StringComparer.CurrentCulture;
str1 = str1.Trim();
str2 = str2.Trim();
string[] words1 = str1.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
string[] words2 = str2.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
if(Math.Min(words1.Length, words2.Length) < 2)
return Enumerable.Empty<string>(); // one word is not supposed to be a commnon word sequence
// use for-loop to find the longest common words
for (int wordCount = words1.Length - 1; wordCount >= 2; wordCount--)
{
// scan word-count from left to right
for (int skipCount = 0; wordCount + skipCount <= words1.Length; skipCount++)
{
// take wordCount-words from left side and walk from left to right
IEnumerable<string> wordSeq = words1.Skip(skipCount).Take(wordCount);
// search sequence in other words
int indexInWords2 = words2.IndexOfSequence(wordSeq, comparer);
if (indexInWords2 >= 0)
{
// found match in other words, must be longest common sequence
return wordSeq;
}
}
}
return Enumerable.Empty<string>();
}
Here's the extension which might also be useful for other requirements:
public static int IndexOfSequence<TSource>(this IEnumerable<TSource> input, IEnumerable<TSource> sequence, IEqualityComparer<TSource> comparer)
{
if (input == null) throw new ArgumentNullException("input");
if (sequence == null) throw new ArgumentNullException("sequence");
if (!sequence.Any()) throw new ArgumentException("Sequence must not be empty", "sequence");
if (comparer == null)
{
comparer = EqualityComparer<TSource>.Default;
}
int index = -1, firstIndex = -1, lastFoundIndex = -1;
bool found = false;
using (IEnumerator<TSource> enumerator = input.GetEnumerator())
{
using (IEnumerator<TSource> enumerator2 = sequence.GetEnumerator())
{
enumerator2.MoveNext();
while (enumerator.MoveNext())
{
index++;
found = comparer.Equals(enumerator.Current, enumerator2.Current);
if (found && firstIndex == -1)
firstIndex = index;
else if (found && index != lastFoundIndex + 1)
found = false; // sequence must be consecutive
if (found && !enumerator2.MoveNext())
return firstIndex;
if(found)
lastFoundIndex = index;
}
}
}
return -1;
}
Here are your three samples:
var commonWords = FindCommonWords(
"Stuff and Things corporation",
"101, Stuff and Things corporat",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords)); // Stuff and Things
commonWords = FindCommonWords(
"101, Stuff and Things corporat",
"or Stuff and Things.",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine( string.Join(" ", commonWords) ); // Stuff and Things
commonWords = FindCommonWords(
"New York city",
"York New city",
StringComparer.CurrentCultureIgnoreCase);
Console.WriteLine(string.Join(" ", commonWords)); // empty sequence, no match
Note that it's written from the scratch and not tested thoroughly.
Upvotes: 4