Reputation: 3758

How to compare 2 string arrays and find all consecutive matches and save indices?

For example, if I have the following 2 arrays:

string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"};
string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"};

I'm trying to compare the userSelect array against the original array and get all consecutive matches based on index. The userSelect array will always be made up of strings from the original array. So the output would be like the following:

int[] match0 = new int[] {0, 1, 2}; // indices for "the quick brown"
int[] match2 = new int[] {4, 5}; // indices for "jumps over"
int[] match1 = new int[] {3}; // index for "dog"

The userSelect array length will never exceed the original array length, however it can be shorter and the words can be in any order. How would I go about doing this?

Upvotes: 7

Answers (5)

Jon

Reputation: 437386

Use LINQ for added fun

After a few tries I came up with a pure LINQ solution that theoretically can be an one-liner. I did try to make it efficient, but of course functional solutions will result in duplicate computations since you cannot keep state.

We start off with a little pre-processing to save on duplicate computations later. Yes, I know what I 'm doing with index is a questionable practice, but if you are careful it works and it gets there fast:

var index = 0;
var lookup = original.ToLookup(s => s, s => index++);

The monstrosity

var occurrences = userSelect
  .Where(lookup.Contains)
  .SelectMany((s, i) => lookup[s]
    .Select(j => new {
      User = userSelect.Skip(i),
      Original = original.Skip(j),
      Skipped = i
    })
    .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped))
                       .TakeWhile(tuple => tuple.Item1 == tuple.Item2)
    )
    .Select(u => new { 
      Word = s, 
      Start = u.Select(v => v.Item3).Min(), 
      Length = u.Count()
    })
  )
  .GroupBy(v => v.Start + v.Length)
  .Select(g => g.OrderBy(u => u.Start).First())
  .GroupBy(v => v.Word)
  .Select(g => g.OrderByDescending(u => u.Length).First())
  .Select(w => Enumerable.Range(w.Start, w.Length).ToArray())
  .ToList();

Printing this with

foreach (var occurrence in occurrences) {
  Console.WriteLine(
    "Maximal match starting with '{0}': [{1}]",
    userSelect[occurrence[0]],
    string.Join(", ", occurrence)
  );
}

gives

Maximal match starting with 'the': [0, 1, 2]
Maximal match starting with 'dog': [3]
Maximal match starting with 'jumps': [4, 5]

It's immediately obvious that you would not want to use this code in production, another (procedural) solution would be preferable by far. However, this solution has the distinction of being purely functional except for lookup. Of course that can also be written functionally:

var lookup = original.Select((s, i) => Tuple.Create)
                     .ToLookup(t => t.Item1, t => t.Item2);

How it works

Warming up, it creates a dictionary-like structure that associates each word in original to the indexes where it appears in the same collection. This will be used later to create as many matching sequences as possible from each word in userSelect (e.g. "the" will result it two matching sequences because it appears twice in original).

Then:

.Where(lookup.Contains)

This is easy, it removes from consideration all words in userSelect that do not appear in original.

 // For each place where the word s appears in original...
.SelectMany((s, i) => lookup[s]
  // Define the two subsequences of userSelect and original to work on.
  // We are trying to find the number of identical elements until first mismatch.
  .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j })

  // Use .Zip to find this subsequence
  .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2))

  // Note the index in original where the subsequence started and its length
  .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() })
)

At this point we have projected each matching word in userSelect to an anonymous object with Start and Length properties. However, sequences where the matching length is N will also have resulted in smaller matching sequences of length N-1, N-2, ... 1.

The key here is to realize that for all subsequences in such sets Start + Length is going to be the same; moreover, subsequences from different sets will have different sums of Start + Length. So let's take advantage to trim the results down:

// Obvious from the above
.GroupBy(v => v.Start + v.Length)

// We want to keep the longest subsequence. Since Start + Length is constant for
// all, it follows the one with the largest Length has the smallest Start:
.Select(g => g.OrderBy(u => u.Start).First())

This will still leave us with as many matches for each word in userSelect as times that word appears in original. So let's also cut that down to the longest match:

.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())

We now have an object like { Word = "the", Start = 0, Length = 3 }. Let's convert that to an array of indexes in userSelect:

.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())

And finally put all these arrays in the same collection and mission accomplished!

Upvotes: 0

p.s.w.g

Reputation: 149020

Here's what I came up with

var matches = 
    (from l in userSelect.Select((s, i) => new { s, i })
     join r in original.Select((s, i) => new { s, i }) 
     on l.s equals r.s 
     group l by r.i - l.i into g
     from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
     group m by new { m.j, m.k } into h
     select h.Select(t => t.i).ToArray())
    .ToArray();

This will output

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 4, 5 } jumps over
matches[2] // { 0 } the 
matches[3] // { 3 } dog

Using the input {"the", "quick", "brown", "the", "lazy", "dog"} yields:

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 0 } the 
matches[2] // { 3 } the
matches[3] // { 3, 4, 5 } the lazy dog

Note that the calls to ToArray are optional. If you don't actually need the results in an array you can leave that out and save a little processing time.

To filter out any sequences that are completely contained with other larger sequences, you can run this code (note the orderby in the modified query):

var matches = 
    (from l in userSelect.Select((s, i) => new { s, i })
     join r in original.Select((s, i) => new { s, i }) 
     on l.s equals r.s 
     group l by r.i - l.i into g
     from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
     group m by new { m.j, m.k } into h
     orderby h.Count() descending
     select h.Select(t => t.i).ToArray());

int take = 0;
var filtered = matches.Where(m => !matches.Take(take++)
                                          .Any(n => m.All(i => n.Contains(i))))
    .ToArray();

Upvotes: 2

Tim Schmelter

Reputation: 460138

This is not very elegant but efficient. When it comes to indices Linq makes it often more complicated and less efficient then simple loops.

string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" };
string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" };
var consecutiveGroups = new Dictionary<int, IList<string>>();
IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>();

int maxIndex = Math.Min(userSelect.Length, original.Length);
if (maxIndex > 0)
{
    int minIndex = 0;
    int lastMatch = int.MinValue;
    for (int i = 0; i < maxIndex; i++)
    {
        var us = userSelect[i];
        var o = original[i];
        if (us == o)
        {
            if (lastMatch == i - 1)
                consecutiveGroups[minIndex].Add(us);
            else
            {
                minIndex = i;
                consecutiveGroups.Add(minIndex, new List<string>() { us });
            }
            lastMatch = i;
        }
        else
            uniques.Add(Tuple.Create(i, us));
    }
}

output the indices of the consecutive groups + the indices of the uniques:

var consecutiveGroupsIndices = consecutiveGroups
    .OrderByDescending(kv => kv.Value.Count)
    .Select(kv => Enumerable.Range(kv.Key, kv.Value.Count).ToArray()
    .ToArray());
foreach(var consIndexGroup in consecutiveGroupsIndices)
    Console.WriteLine(string.Join(",", consIndexGroup));
Console.WriteLine(string.Join(",", uniques.Select(u => u.Item1)));

Upvotes: 1

Jim Mischel

Reputation: 134005

This would be easier if words couldn't be repeated . . .

The general idea is to create a Dictionary<string, List<int>> from the original word list. That will tell you which words are used at what positions. The dictionary for your sample would be:

key="the", value={0, 6}
key="quick", value={1}
key="brown", value={2}
... etc

Now, when you get the user's input, you step through it sequentially, looking up the words in your dictionary to get the list of positions.

So you look up a word and it's in the dictionary. You save the position(s) returned from the dictionary. Look up the next word. There are three conditions you need to handle:

The word isn't in the dictionary. Save your previous consecutive grouping and go to the next word, where you'll potentially start a new group.
The word is in the dictionary, but the none of the positions returned match the expected positions (the expected positions being one more than the saved positions from the last word). Save your previous consecutive group and go to the next word, where you'll potentially start a new group.
The word is in the dictionary and one of the returned positions matches the expected position. Save those positions and go to the next word.

I hope you get the idea.

Upvotes: 2

evanmcdonnal

Reputation: 48096

This doesn't do exactly what you would like but it's an a really clean and simple way to get a new array with all of the common strings (ie take the intersection of the two arrays).

var results = array1.Intersect(array2, StringComparer.OrdinalIgnoreCase);

After executing the resutls array will have every string (ignoring case) that occurs in both array1 and array2.

If you want a bit of theory the intersect method is based on the intersection operation you do on sets in lambda calculus. The collections in C# implement all the common set operations so it's worth having some familiarity of them. Here's a link to the wiki article; http://en.wikipedia.org/wiki/Intersection_(set_theory)