InBetween
InBetween

Reputation: 32770

Identifying most frequent pattern. ¿Counting criteria?

I'm actually working in a piece of code that will take a generic enumeration and a given length and will return the most frequent pattern found in the enumeration of the specified length and how many times it appears.

My method therefore has the following signature:

public static IEnumerable<T> ExtractFixedLengthPatter<T>(IEnumerable<T> source, int length, out int timesFound) { ... }

The way I've implemented this is as follows (I won't post the corresponding code, its a little long):

The way I build this tree is that I iterate the stream and in each iteration I create a custom iterator that takes the first length items from the current position and populates the tree:

    1rst outer iteration: {1}
       Fixed length iterator: {1}
                              {2}

    2nd outer iteration: {2}
       Fixed length iterator: {2}
                              {1}

And so on...

Then, I simply identify the final node with maximum count and traverse back up the tree obtaining the most frequent pattern reversed. I reverse the pattern and I'm done.

This algorithm works really well and is pretty fast. The problem is that a co-worker is claiming it has a serious bug. Consider the following case:

 111111

Its obvious the most frequent pattern of length 2 is 11. The question is, how many times does it appear in the enumeration? My coworker claims the correct answer is 3:

 111111
 11
   11
     11

My algorithm returns 5:

 111111
 11
  11
   11
    11
     11

Which one is the correct answer? I'm inclined to believe its 5 but if its 3, does anyone see an easy way I can adapt or change the algorithm to discern this kind of situation?

Upvotes: 0

Views: 64

Answers (2)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726809

If the question is "how many times 11 appears inside 111111", there is only one answer: five times - at indexes 0, 1, 2, 3, and 4.

Your co-worker is answering a different question - how many non-overlapping copies of the 11 pattern are contained in 111111, which is, indeed, three: at indexes 0, 2, and 4.

we need non overlapping copies, I'm not sure how to adapt the existing algorithm and I don't want to loose the time it took to write it; its working fast and efficiently. Maybe storing index information of each node, and doing a final pass checking indices and pattern length to remove overlapping copies?

You can reuse your algorithm with minimal modifications. Keep the index of the last element at which a match has been found at each leaf node, along with the total count that you currently keep. When your algorithm arrives at the leaf and is ready to increment the total count, it should check if the prior index is at least n items back from the current one. If it is more than n items away, increment the count; otherwise, ignore this sequence.

Upvotes: 1

danbanica
danbanica

Reputation: 3038

If you would have to adapt your method to use your colleague's counting procedure you could do like this:

  • add an additional field (besides the count field) that will be used for the final nodes, name it lastIndex, to store the last position that resulted in incrementing the count for that final node;
  • every time the count of a final node should be incremented, first check that it does not overlap with the previous occurrence of this sequence (which is known to have started at lastIndex). The condition would probably look like this:

if (node.lastIndex == INVALID_INDEX || node.lastIndex + length <= iterator.currentIndex()) {
    node.count++;
    node.lastIndex = iterator.currentIndex();
}

Upvotes: 1

Related Questions