Identifying most frequent pattern. ¿Counting criteria?

Question

I'm actually working in a piece of code that will take a generic enumeration and a given length and will return the most frequent pattern found in the enumeration of the specified length and how many times it appears.

My method therefore has the following signature:

public static IEnumerable ExtractFixedLengthPatter(IEnumerable source, int length, out int timesFound) { ... }

The way I've implemented this is as follows (I won't post the corresponding code, its a little long):

Given, lets say, an enumerable of ints: 121234161221 and a pattern length of 2 I build the following tree:

[1] -- [2] {count: 3}
    -- [6] {count: 1}

[2] -- [1] {count: 2}
    -- [2] {count: 1}
    -- [3] {count: 1}

[3] -- [4] {count: 1}

[4] -- [1] {count: 1}

[6] -- [1] {count: 1}

The way I build this tree is that I iterate the stream and in each iteration I create a custom iterator that takes the first length items from the current position and populates the tree:

    1rst outer iteration: {1}
       Fixed length iterator: {1}
                              {2}

    2nd outer iteration: {2}
       Fixed length iterator: {2}
                              {1}

And so on...

Then, I simply identify the final node with maximum count and traverse back up the tree obtaining the most frequent pattern reversed. I reverse the pattern and I'm done.

This algorithm works really well and is pretty fast. The problem is that a co-worker is claiming it has a serious bug. Consider the following case:

Its obvious the most frequent pattern of length 2 is 11. The question is, how many times does it appear in the enumeration? My coworker claims the correct answer is 3:

My algorithm returns 5:

Which one is the correct answer? I'm inclined to believe its 5 but if its 3, does anyone see an easy way I can adapt or change the algorithm to discern this kind of situation?

Sergey Kalinichenko · Accepted Answer

If the question is "how many times 11 appears inside 111111", there is only one answer: five times - at indexes 0, 1, 2, 3, and 4.

Your co-worker is answering a different question - how many non-overlapping copies of the 11 pattern are contained in 111111, which is, indeed, three: at indexes 0, 2, and 4.

we need non overlapping copies, I'm not sure how to adapt the existing algorithm and I don't want to loose the time it took to write it; its working fast and efficiently. Maybe storing index information of each node, and doing a final pass checking indices and pattern length to remove overlapping copies?

You can reuse your algorithm with minimal modifications. Keep the index of the last element at which a match has been found at each leaf node, along with the total count that you currently keep. When your algorithm arrives at the leaf and is ready to increment the total count, it should check if the prior index is at least n items back from the current one. If it is more than n items away, increment the count; otherwise, ignore this sequence.

Identifying most frequent pattern. ¿Counting criteria?

Answers (2)

Related Questions

Identifying most frequent pattern. &#191;Counting criteria?

Answers (2)

Related Questions

Identifying most frequent pattern. ¿Counting criteria?