Reputation: 32770
I'm actually working in a piece of code that will take a generic enumeration and a given length and will return the most frequent pattern found in the enumeration of the specified length and how many times it appears.
My method therefore has the following signature:
public static IEnumerable<T> ExtractFixedLengthPatter<T>(IEnumerable<T> source, int length, out int timesFound) { ... }
The way I've implemented this is as follows (I won't post the corresponding code, its a little long):
Given, lets say, an enumerable of ints: 121234161221
and a pattern length of 2
I build the following tree:
[1] -- [2] {count: 3}
-- [6] {count: 1}
[2] -- [1] {count: 2}
-- [2] {count: 1}
-- [3] {count: 1}
[3] -- [4] {count: 1}
[4] -- [1] {count: 1}
[6] -- [1] {count: 1}
The way I build this tree is that I iterate the stream and in each iteration I create a custom iterator that takes the first length
items from the current position and populates the tree:
1rst outer iteration: {1}
Fixed length iterator: {1}
{2}
2nd outer iteration: {2}
Fixed length iterator: {2}
{1}
And so on...
Then, I simply identify the final node with maximum count and traverse back up the tree obtaining the most frequent pattern reversed. I reverse the pattern and I'm done.
This algorithm works really well and is pretty fast. The problem is that a co-worker is claiming it has a serious bug. Consider the following case:
111111
Its obvious the most frequent pattern of length 2
is 11
. The question is, how many times does it appear in the enumeration? My coworker claims the correct answer is 3
:
111111
11
11
11
My algorithm returns 5
:
111111
11
11
11
11
11
Which one is the correct answer? I'm inclined to believe its 5
but if its 3
, does anyone see an easy way I can adapt or change the algorithm to discern this kind of situation?
Upvotes: 0
Views: 64
Reputation: 726809
If the question is "how many times 11
appears inside 111111
", there is only one answer: five times - at indexes 0, 1, 2, 3, and 4.
Your co-worker is answering a different question - how many non-overlapping copies of the 11
pattern are contained in 111111
, which is, indeed, three: at indexes 0, 2, and 4.
we need non overlapping copies, I'm not sure how to adapt the existing algorithm and I don't want to loose the time it took to write it; its working fast and efficiently. Maybe storing index information of each node, and doing a final pass checking indices and pattern length to remove overlapping copies?
You can reuse your algorithm with minimal modifications. Keep the index of the last element at which a match has been found at each leaf node, along with the total count that you currently keep. When your algorithm arrives at the leaf and is ready to increment the total count, it should check if the prior index is at least n
items back from the current one. If it is more than n
items away, increment the count; otherwise, ignore this sequence.
Upvotes: 1
Reputation: 3038
If you would have to adapt your method to use your colleague's counting procedure you could do like this:
count
field) that will be used for the final nodes, name it lastIndex
, to store the last position that resulted in incrementing the count
for that final node;count
of a final node should be incremented, first check that it does not overlap with the previous occurrence of this sequence (which is known to have started at lastIndex
). The condition would probably look like this:if (node.lastIndex == INVALID_INDEX || node.lastIndex + length <= iterator.currentIndex()) {
node.count++;
node.lastIndex = iterator.currentIndex();
}
Upvotes: 1