cppcoder
cppcoder

Reputation: 1204

Getting unique elements from a container [c++]

I was looking to get only the unique elements from a container. Let's say srcContainer is the container from which I want unique elements. I looked at three options:

  1. Using std::unique

       std::sort(srcContainer.begin(), srcContainer.end());
       srcContainer.erase(std::unique(srcContainer.begin(), srcContainer.end()), srcContainer.end());
    
  2. Using BOOST::unique

    boost::erase(srcContainer, boost::unique<boost::return_found_end>(boost::sort(srcContainer)));  
    
  3. My own method

    std::set<T> uniqueElems(srcContainer.begin(), srcContainer.end());  
    srcContainer.clear();  
    srcContainer.insert(srcContainer.end(), uniqueElems.begin(), uniqueElems.end()); 
    

The issue with 1. and 2. are that they change the order in which members occurred in the original srcContainer. With 3. there is no change in order, and in addition it gives a much better performance compared to 1. and 2 (Is it because there is no explicit sorting in 3. ??) above. The elapsed wall clock time for 3 methods above and the number of elements in srcContainer are given below:

  1. size of srcContainer (contains integers) = 1e+6
    - std::unique = 1.04779 secs
    - BOOST::unique = 1.04774 secs
    - Own method = 0.481638 secs

  2. size of srcContainer (contains integers) = 1e+8
    - std::unique = 151.554 secs
    - BOOST::unique = 151.474 secs
    - Own method = 57.5693 secs

My question is:

  1. Is there better way to find unique using std::unique or BOOST::unique or any other code and maintaining the original order in the container?
  2. Any issue with using method 3. above.

For performance profiling srcContainer was created as follows:

std::vector<int> srcContainer;  
int halfWay = numElems/2;  
for (size_t k=0; k<numElems; ++k) {  
   if (k < halfWay)  
      srcContainer.push_back(k);  
   else  
      srcContainer.push_back(k - halfWay);  
}  

Edits:
Agreed with comments that method 3. also changes the order of elements. Is there a better way to get unique elements without changing order?

Thanks

Upvotes: 4

Views: 1401

Answers (1)

Mark B
Mark B

Reputation: 96241

EDIT based on info about source data: The reason you're seeing the set insertion complete faster than sorting the vector is that your input data is two already sorted ranges. For quicksort (typically used by std::sort) this is a degenerate case and one of the worst possible inputs you could give it. For an input size of 1e8 changing the sort from std::sort to std::stable_sort cut the runtime from ~25s to <9s.

If you want to keep the original item order, you could try something like the following which keeps a hash of all the items. I have no idea what the performance of this would be, but for example you could utilize an approach with hashing and remove_if as sketched below:

struct Remover
{
    explicit Remover(hash& found_items) : found_items_(found_items) { }
    bool operator()(const Iter& item) { retval = <does exist in hash>; add to hash; return retval; }

    hash& found_items_;
};

hash dup_finder;
Remover remover(dup_finder);
std::erase(std::remove_if(src.begin(), src.end(), remover), src.end());

Original components of my answer:

If the elements in the source container are already mostly sorted, you may see better performance with stable_sort rather than sort prior to calling unique. I'm unable to guess without more information about yoru data set what might cause option 3 to perform better than 1&2.

Option 3 should remove uniques but keep in mind that in spite of what you assert, it will still reorder the items in exactly the same way that the first two options do.

Upvotes: 1

Related Questions