stefan
stefan

Reputation: 10355

eliminating duplicates in std::vector

I have a very large std::vector of std::vectors containing a fixed number of unsigned integers.

All vectors of uints are sorted ascending.

My current way to eliminate duplicate vectors is

unsigned int i = 0;
while ( i < new_combs.size() )
{
  unsigned int j = i + 1;
  while ( j < new_combs.size() )
  {
     unsigned int k = 0;
     while ( k < new_combs.at(i).size() && new_combs.at(i).at(k) == new_combs.at(j).at(k) )
        ++k;
     if ( k == new_combs.at(j).size() )
        new_combs.erase(new_combs.begin() + j);
     else
        ++j;
  }
  ++i;
}

here, new_combs is a vector containing vectors as mentioned above.

Is there a more efficient way to eliminate duplicates if the vector of vectors is unsorted?

Upvotes: 1

Views: 1828

Answers (6)

Patrick
Patrick

Reputation: 23629

There are several elements in your code that ring my alarm bell regarding performance.

First, you are using vectors. Erasing elements from vectors is always slow. You might consider using a different container (std::list) or adjust your code so that you have a special value for nothing (e.g. zero or -1).

Second, you could use an std::set or std::unordered_set to keep values that you already encountered. That way, you only have to loop over your vectors once.

EDIT: Forget this answer. I misread the question and thought that duplicate values (not duplicate vectors) had to be removed.

Nevertheless, some reactions on the comments given:

  • @Jerry: I agree that vectors are in most cases faster than lists, but only if the size of the vector is limited. If the vector contains 1 million elements and you need to remove the 3rd, then the 5th, then the 10th, ... you end up moving lots of elements. In such a case a list could be faster.
  • @James: in the original question, the elements were not removed from the end of the vector, but in the middle. If the vector is very big (let's say 1 million elements), then removing elements can still become a bottleneck. However, I agree than using a sort, followed by a unique is probably faster.

Upvotes: 0

Branko Dimitrijevic
Branko Dimitrijevic

Reputation: 52157

I agree with Luchian Grigore's answer, but you might also consider converting the whole outer vector to unordered_set, which is an O(n) operation provided hashes of sub-vectors are not too lopsided (as opposed to average O(n*log(n)) for sorting). You can even use pointers to sub-vectors in your unordered_set, to avoid unnecessary copying. This can be an important performance difference for large amounts of data.

This example illustrates the basic idea of using your own hash function and pointers (it deals with vector of strings and uses unordered_map, not unordered_set, but you should be able to modify it fairly easily to your needs).

Upvotes: 0

Ivaylo Strandjev
Ivaylo Strandjev

Reputation: 71009

Not much you can do if the vector is unsorted. If it is sorted however you can use the unique method defined in algorithm:

new_combs.erase(unique(new_combs.begin(), new_combs.end()), new_combs.end());

Upvotes: 2

thiton
thiton

Reputation: 36059

Asymptotically, your algorithm seems like the usual O(n) implementation and is therefore optimal. (Even though I didn't understand your diagonalization strategy with i and j and why you only erase, but never move elements. Your code is highly unclear.) However, you are duplicating the STL, and a shorter version of a unique-ing loop is:

struct unique {
    template <class C>
    void operator()( C& c ) {
         c.erase( std::unique( c.begin(), c.end() ), c.end() );
    }
};

std::for_each( new_combs.begin(), new_combs.end(), unique() );

Upvotes: 0

xpapad
xpapad

Reputation: 4456

Have you considered using std::set? It is ordered and doesn't allow duplicates to begin with.

Upvotes: 3

Luchian Grigore
Luchian Grigore

Reputation: 258678

A shorter way would be using <algorithm>:

std::sort(new_combs.begin(), new_combs.end());
new_combs.erase(std::unique(new_combs.begin(), new_combs.end()), new_combs.end());

Unless you specifically need a std::vector, you can use std::set to avoid duplicates.

Upvotes: 9

Related Questions