Reputation: 3758
For example, if I have the following 2 arrays:
string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"};
string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"};
I'm trying to compare the userSelect array against the original array and get all consecutive matches based on index. The userSelect array will always be made up of strings from the original array. So the output would be like the following:
int[] match0 = new int[] {0, 1, 2}; // indices for "the quick brown"
int[] match2 = new int[] {4, 5}; // indices for "jumps over"
int[] match1 = new int[] {3}; // index for "dog"
The userSelect array length will never exceed the original array length, however it can be shorter and the words can be in any order. How would I go about doing this?
Upvotes: 7
Views: 4071
Reputation: 437386
After a few tries I came up with a pure LINQ solution that theoretically can be an one-liner. I did try to make it efficient, but of course functional solutions will result in duplicate computations since you cannot keep state.
We start off with a little pre-processing to save on duplicate computations later. Yes, I know what I 'm doing with index is a questionable practice, but if you are careful it works and it gets there fast:
var index = 0;
var lookup = original.ToLookup(s => s, s => index++);
var occurrences = userSelect
.Where(lookup.Contains)
.SelectMany((s, i) => lookup[s]
.Select(j => new {
User = userSelect.Skip(i),
Original = original.Skip(j),
Skipped = i
})
.Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped))
.TakeWhile(tuple => tuple.Item1 == tuple.Item2)
)
.Select(u => new {
Word = s,
Start = u.Select(v => v.Item3).Min(),
Length = u.Count()
})
)
.GroupBy(v => v.Start + v.Length)
.Select(g => g.OrderBy(u => u.Start).First())
.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())
.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())
.ToList();
Printing this with
foreach (var occurrence in occurrences) {
Console.WriteLine(
"Maximal match starting with '{0}': [{1}]",
userSelect[occurrence[0]],
string.Join(", ", occurrence)
);
}
gives
Maximal match starting with 'the': [0, 1, 2]
Maximal match starting with 'dog': [3]
Maximal match starting with 'jumps': [4, 5]
It's immediately obvious that you would not want to use this code in production, another (procedural) solution would be preferable by far. However, this solution has the distinction of being purely functional except for lookup
. Of course that can also be written functionally:
var lookup = original.Select((s, i) => Tuple.Create)
.ToLookup(t => t.Item1, t => t.Item2);
Warming up, it creates a dictionary-like structure that associates each word in original
to the indexes where it appears in the same collection. This will be used later to create as many matching sequences as possible from each word in userSelect
(e.g. "the" will result it two matching sequences because it appears twice in original
).
Then:
.Where(lookup.Contains)
This is easy, it removes from consideration all words in userSelect
that do not appear in original
.
// For each place where the word s appears in original...
.SelectMany((s, i) => lookup[s]
// Define the two subsequences of userSelect and original to work on.
// We are trying to find the number of identical elements until first mismatch.
.Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j })
// Use .Zip to find this subsequence
.Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2))
// Note the index in original where the subsequence started and its length
.Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() })
)
At this point we have projected each matching word in userSelect
to an anonymous object with Start
and Length
properties. However, sequences where the matching length is N will also have resulted in smaller matching sequences of length N-1, N-2, ... 1.
The key here is to realize that for all subsequences in such sets Start + Length
is going to be the same; moreover, subsequences from different sets will have different sums of Start + Length
. So let's take advantage to trim the results down:
// Obvious from the above
.GroupBy(v => v.Start + v.Length)
// We want to keep the longest subsequence. Since Start + Length is constant for
// all, it follows the one with the largest Length has the smallest Start:
.Select(g => g.OrderBy(u => u.Start).First())
This will still leave us with as many matches for each word in userSelect
as times that word appears in original
. So let's also cut that down to the longest match:
.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())
We now have an object like { Word = "the", Start = 0, Length = 3 }
. Let's convert that to an array of indexes in userSelect
:
.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())
And finally put all these arrays in the same collection and mission accomplished!
Upvotes: 0
Reputation: 149020
Here's what I came up with
var matches =
(from l in userSelect.Select((s, i) => new { s, i })
join r in original.Select((s, i) => new { s, i })
on l.s equals r.s
group l by r.i - l.i into g
from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
group m by new { m.j, m.k } into h
select h.Select(t => t.i).ToArray())
.ToArray();
This will output
matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 4, 5 } jumps over
matches[2] // { 0 } the
matches[3] // { 3 } dog
Using the input {"the", "quick", "brown", "the", "lazy", "dog"}
yields:
matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 0 } the
matches[2] // { 3 } the
matches[3] // { 3, 4, 5 } the lazy dog
Note that the calls to ToArray
are optional. If you don't actually need the results in an array you can leave that out and save a little processing time.
To filter out any sequences that are completely contained with other larger sequences, you can run this code (note the orderby
in the modified query):
var matches =
(from l in userSelect.Select((s, i) => new { s, i })
join r in original.Select((s, i) => new { s, i })
on l.s equals r.s
group l by r.i - l.i into g
from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
group m by new { m.j, m.k } into h
orderby h.Count() descending
select h.Select(t => t.i).ToArray());
int take = 0;
var filtered = matches.Where(m => !matches.Take(take++)
.Any(n => m.All(i => n.Contains(i))))
.ToArray();
Upvotes: 2
Reputation: 460138
This is not very elegant but efficient. When it comes to indices Linq makes it often more complicated and less efficient then simple loops.
string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" };
string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" };
var consecutiveGroups = new Dictionary<int, IList<string>>();
IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>();
int maxIndex = Math.Min(userSelect.Length, original.Length);
if (maxIndex > 0)
{
int minIndex = 0;
int lastMatch = int.MinValue;
for (int i = 0; i < maxIndex; i++)
{
var us = userSelect[i];
var o = original[i];
if (us == o)
{
if (lastMatch == i - 1)
consecutiveGroups[minIndex].Add(us);
else
{
minIndex = i;
consecutiveGroups.Add(minIndex, new List<string>() { us });
}
lastMatch = i;
}
else
uniques.Add(Tuple.Create(i, us));
}
}
output the indices of the consecutive groups + the indices of the uniques:
var consecutiveGroupsIndices = consecutiveGroups
.OrderByDescending(kv => kv.Value.Count)
.Select(kv => Enumerable.Range(kv.Key, kv.Value.Count).ToArray()
.ToArray());
foreach(var consIndexGroup in consecutiveGroupsIndices)
Console.WriteLine(string.Join(",", consIndexGroup));
Console.WriteLine(string.Join(",", uniques.Select(u => u.Item1)));
Upvotes: 1
Reputation: 134005
This would be easier if words couldn't be repeated . . .
The general idea is to create a Dictionary<string, List<int>>
from the original word list. That will tell you which words are used at what positions. The dictionary for your sample would be:
key="the", value={0, 6}
key="quick", value={1}
key="brown", value={2}
... etc
Now, when you get the user's input, you step through it sequentially, looking up the words in your dictionary to get the list of positions.
So you look up a word and it's in the dictionary. You save the position(s) returned from the dictionary. Look up the next word. There are three conditions you need to handle:
I hope you get the idea.
Upvotes: 2
Reputation: 48096
This doesn't do exactly what you would like but it's an a really clean and simple way to get a new array with all of the common strings (ie take the intersection of the two arrays).
var results = array1.Intersect(array2, StringComparer.OrdinalIgnoreCase);
After executing the resutls
array will have every string (ignoring case) that occurs in both array1
and array2
.
If you want a bit of theory the intersect method is based on the intersection operation you do on sets in lambda calculus. The collections in C# implement all the common set operations so it's worth having some familiarity of them. Here's a link to the wiki article; http://en.wikipedia.org/wiki/Intersection_(set_theory)
Upvotes: 1