user2036087
user2036087

Reputation: 311

Finding exactly N consecutives in a sorted list

I recently stumbled upon a little problem.

Part of an algorithm I was working on needed to find n consecutive numbers in a sorted list of numbers.

So, for example, the list would look something like this:

1 2 2 3 4 5 5 5 6 7 8 9 9 9 9

Given that list and N, the amount of consecutive duplicates, the algorithm needed to find the first number in the smallest group of exactly N consecutive numbers. So for example with N = 2 and the given list, the algorithm should find "2". With N = 3 it should pass the group of 2's, find the group of 5's instead since it's the smallest group of 3 consecutive duplicates in that list. It shouldn't return 9 since there are actually 4 consecutive 9's and with N = 3 we are looking for the smallest group of exactly 3 consecutives.

I did in the end cobble together some peace of crap code that did the job, but I was wondering how would some experienced programmer do this. Utilizing as much of the C++11 Style of Code advertised by Stroustroup himself and using as much of the C++11 STL for correctness, portability and compactness for reasoning.

Upvotes: 3

Views: 322

Answers (6)

lucas92
lucas92

Reputation: 464

#include <algorithm>
#include <array>
#include <iostream>

using namespace std;

template<class T>
class Sequence
{
public:
    Sequence(const uint32_t num_items);
    ~Sequence(){}

    bool operator()(const T data);
private:
    T m_value;
    uint32_t m_counter;
    uint32_t m_max;
};

template<class T>
Sequence<T>::Sequence(const uint32_t num_items)
  : m_value(0),
    m_counter(0),
    m_max(num_items)
{
}

template<class T>
bool Sequence<T>::operator()(const T data)
{
    if(m_value == data) {
        m_counter++;
    } else if(m_counter == m_max{
        m_value = data;
        m_counter = 0;
        return true;
    } else{
        m_value = data;
        m_counter = 0;
    }
    return false;
}

int main()
{
    int data[] = {1,2,2,3,4,5,5,5,6,7,8,9,9,9,9};
    array<int,15> ar;
    for(uint32_t i = 0; i < 15; i++)
        ar[i] = data[i];

    //find three consecutive numbers
    Sequence<int> seq(3);

    //getting the first occurence of the sequence
    array<int,15>::iterator it = find_if(ar.begin(),ar.end(),seq);

    //printing the iterator position from begin
    cout << distance(ar.begin(),it) << endl;

    return 0;
}

Upvotes: 1

laune
laune

Reputation: 31290

If speed doesn't matter:

template <class T >
T firstOfN( std::vector<T> list, unsigned N ){
  std::multiset<T> mset( list.begin(), list.end() );
  for( typename std::multiset<T>::iterator it = mset.begin(); it != mset.end(); ++it ){
    if( mset.count( *it ) == N ) return *it;
  }
  throw std::exception();
}

Upvotes: 2

Ivaylo Strandjev
Ivaylo Strandjev

Reputation: 70929

Here is my solution. It does not use any stl standard algorithm but it has the best possible complexity - O(n) and I believe it is quite readable and understandable:

  unsigned cur_value_index = 0;
  unsigned range_size = 1;
  for (unsigned i = 1; i < a.size(); ++i) {
    if (a[i] == a[cur_value_index]) {
      range_size++;
    } else {
      if (range_size == N) {
        cout << cur_value_index << endl;
        break;
      }
      cur_value_index = i;
      range_size = 1;
    }
  }

      if (range_size == N) {       cout << cur_value_index << endl;       }

I assume the sequence is provided in an array a and N is the limit you talk about in the question.

I have used vector for illustration but the very same algorithm can be applied if we don't have random access for instance for list. In that case we would keep an iterator to an element of the sequence instead of an index, but the rest would remain unchanged.

Upvotes: 1

dyp
dyp

Reputation: 39101

On the algorithmic side of things, there's an interesting optimization; pseudo-code:

size_t N;
RaIterator cur = myvector.begin(), end = myvector.end();
while(cur < end-(N-1))
{
    if(*cur == *(cur+N))
    {
        if(cur+N == end || *cur != *(cur+N+1))
            return {cur, cur+N};
        else
            cur = upper_bound(cur+N+1, end, *cur);
    }else
    {
        cur = lower_bound(cur, cur+N, *(cur+N));
    }
}
return {end, end};

If we have random-access iterators, we can skip ranges pretty quickly, once we have an initial element (preceding elements are smaller, succeeding greater or equal):

  • If *cur == *(cur+N), then the range with value *cur is large enough. If *cur != *(cur+N+1), or cur+N == end, then it is indeed the range we're looking for. Else, it is too big, and we can search for the next range (either linearly or with a binary search in [cur+N+1, end)).

  • Else, *cur != *(cur+N), then the current range is too small. Every range completely inside [cur, cur+N] is also too small, so the next range to check is a range that starts inside [cur, cur+N] and extends beyond cur+N. This range is of the value *(cur+N), so we only need to find its initial element (binary search).

Note: Due to the increased "complexity" of a binary search as opposed to a linear search (the constant factors), and due to the rather unpredictable memory accesses, this will probably be slower for a list of small ranges than the strictly linear approach.

Upvotes: 2

Joop Eggen
Joop Eggen

Reputation: 109547

When N is larger, the detection of N same numbers might be "optimized" a bit.

for (int i = 0; i < n - N + 1; ) {
    int ai = a[i]; // New value
    if (ai == a[i + N - 1]) { // Last element same
        if (i + N >= n || ai != a[i + N]) { // Thereafter not
            return i;
        }
        i += N; // Move to last known same element (or past end)
    }
    // Go to next new value:
    ++i;
    while (i < n - N + 1 && a[i] == ai) {
        ++i;
    }
}

It relies on having at the start of the for-loop a new value.

Upvotes: 1

Jerry Coffin
Jerry Coffin

Reputation: 490108

A great deal here depends on frequency of insertion and deletion vs. searching, how large of lists you're looking at, etc.

For the moment, I'm going to make two assumptions:

  1. You're dealing with large enough lists that an asymptotically better algorithm is likely to win over the obvious linear search.
  2. You're doing a lot of queries with the data essentially static.

If that's true, you start by run-length encoding the input data, so you get value/count pairs.

Then you sort those pairs based primarily on the count, and secondarily on the value. Finally, use std::lower_bound to find a value, with the comparison based solely on the count.

This requires O(N log N) for the preprocessing. In exchange, each query requires O(log N) instead of O(N). Therefore, you need to do O(N) queries on the preprocessed data to justify the preprocessing.

Upvotes: 1

Related Questions