Reputation: 43209
I have an array of integers that I need to remove duplicates from while maintaining the order of the first occurrence of each integer. I can see doing it like this, but imagine there is a better way that makes use of STL algorithms better? The insertion is out of my control, so I cannot check for duplicates before inserting.
int unsortedRemoveDuplicates(std::vector<int> &numbers) {
std::set<int> uniqueNumbers;
std::vector<int>::iterator allItr = numbers.begin();
std::vector<int>::iterator unique = allItr;
std::vector<int>::iterator endItr = numbers.end();
for (; allItr != endItr; ++allItr) {
const bool isUnique = uniqueNumbers.insert(*allItr).second;
if (isUnique) {
*unique = *allItr;
const int duplicates = endItr - unique;
numbers.erase(unique, endItr);
return duplicates;
How can this be done using STL algorithms?
Upvotes: 31
Views: 33642
Reputation: 210755
The naive way is to use std::set
as everyone tells you. It's overkill and has poor cache locality (slow).
The smart* way is to use std::vector
appropriately (make sure to see footnote at bottom):
#include <algorithm>
#include <vector>
struct target_less
template<class It>
bool operator()(It const &a, It const &b) const { return *a < *b; }
struct target_equal
template<class It>
bool operator()(It const &a, It const &b) const { return *a == *b; }
template<class It> It uniquify(It begin, It const end)
std::vector<It> v;
v.reserve(static_cast<size_t>(std::distance(begin, end)));
for (It i = begin; i != end; ++i)
{ v.push_back(i); }
std::stable_sort(v.begin(), v.end(), target_less());
v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
std::sort(v.begin(), v.end());
size_t j = 0;
for (It i = begin; i != end && j != v.size(); ++i)
if (i == v[j])
using std::iter_swap; iter_swap(i, begin);
return begin;
Then you can use it like:
int main()
std::vector<int> v;
v.erase(uniquify(v.begin(), v.end()), v.end());
*Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.
A benchmark showing that this is indeed faster was added in (based on this answer's uniquify()
Upvotes: 19
Reputation: 316
in my software I allways declare a huge global or static array of (int)-1, t0 to speed this kind of operations,
#include <vector>
#include <stdlib.h>
#include <algorithm>
#include <iostream>
static std::vector<int> t0s (1000000000, -1);
int main (int argc, char* argv []) {
const int L (10), N (5);
std::vector<int> vec (L, 0);
std::for_each (vec.begin (), vec.end (), [&N] (auto& s) {s = rand () %N;});
std::cout << "\tvec == "; std::for_each (vec.begin (), vec.end (), [] (const auto& s) {std::cout << s << " ";}); std::cout << std::endl;
auto& t0 (t0s);
const auto zero ((const int) 0), sm1 ((const int) -1);
auto i (vec.begin ());
std::for_each (vec.begin (), vec.end (), [&t0, &i, &sm1, &zero] (const int& s) {
if (t0 [s] == sm1) t0 [(*i++ = s)] = zero;
if (i != vec.end ()) vec.erase (i, vec.end ());
//leaving t0 clean
std::for_each (vec.begin (), i, [&t0, &sm1] (const auto& s) {t0 [s] = sm1;});
std::cout << "\tvec same order but unique == "; std::for_each (vec.begin (), vec.end (), [] (const auto& s) {std::cout << s << " ";});
return 0;
Normal output :
vec == 3 1 2 0 3 0 1 2 4 1
vec same order but unique == 3 1 2 0 4
Upvotes: 0
Reputation: 1870
Here is a c++11 generic version that works with iterators and doesn't allocate additional storage. It may have the disadvantage of being O(n^2) but is likely faster for smaller input sizes.
template<typename Iter>
Iter removeDuplicates(Iter begin,Iter end)
auto it = begin;
while(it != end)
auto next = std::next(it);
if(next == end)
end = std::remove(next,end,*it);
it = next;
return end;
Sample Code:
Upvotes: -1
Reputation: 1259
Fast and simple, C++11:
template<typename T>
size_t RemoveDuplicatesKeepOrder(std::vector<T>& vec)
std::set<T> seen;
auto newEnd = std::remove_if(vec.begin(), vec.end(), [&seen](const T& value)
if (seen.find(value) != std::end(seen))
return true;
return false;
vec.erase(newEnd, vec.end());
return vec.size();
Upvotes: 15
Reputation: 35921
To verify the performance of the proposed solutions, I've tested three of them, listed below. The tests are using random vectors with 1 mln elements and different ratio of duplicates (0%, 1%, 2%, ..., 10%, ..., 90%, 100%).
Mehrdad's solution, currently the accepted answer:
void uniquifyWithOrder_sort(const vector<int>&, vector<int>& output)
using It = vector<int>::iterator;
struct target_less
bool operator()(It const &a, It const &b) const { return *a < *b; }
struct target_equal
bool operator()(It const &a, It const &b) const { return *a == *b; }
auto begin = output.begin();
auto const end = output.end();
vector<It> v;
v.reserve(static_cast<size_t>(distance(begin, end)));
for (auto i = begin; i != end; ++i)
sort(v.begin(), v.end(), target_less());
v.erase(unique(v.begin(), v.end(), target_equal()), v.end());
sort(v.begin(), v.end());
size_t j = 0;
for (auto i = begin; i != end && j != v.size(); ++i)
if (i == v[j])
using std::iter_swap; iter_swap(i, begin);
output.erase(begin, output.end());
void uniquifyWithOrder_set_copy_if(const vector<int>& input, vector<int>& output)
struct NotADuplicate
bool operator()(const int& element)
return _s.insert(element).second;
set<int> _s;
vector<int> uniqueNumbers;
NotADuplicate pred;
void uniquifyWithOrder_set_remove_if(const vector<int>& input, vector<int>& output)
set<int> seen;
auto newEnd = remove_if(output.begin(), output.end(), [&seen](const int& value)
if (seen.find(value) != end(seen))
return true;
return false;
output.erase(newEnd, output.end());
They are slightly modified for simplicity, and to allow comparing in-place solutions with not in-place ones. The full code used to test is available here.
The results suggest that if you know you'll have at least 1% duplicates the remove_if
solution with std::set
is the best one. Otherwise, you should go with the sort
// Intel(R) Core(TM) i7-2600 CPU @ 3.40 GHz 3.40 GHz
// 16 GB RAM, Windows 7, 64 bit
// cl 19
// /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /Zc:forScope /Gd /Oi /MD /EHsc /nologo /Ot
// 1000 random vectors with 1 000 000 elements each.
// 11 tests: with 0%, 10%, 20%, ..., 90%, 100% duplicates in vectors.
// Ratio: 0
// set_copy_if : Time : 618.162 ms +- 18.7261 ms
// set_remove_if : Time : 650.453 ms +- 10.0107 ms
// sort : Time : 212.366 ms +- 5.27977 ms
// Ratio : 0.1
// set_copy_if : Time : 34.1907 ms +- 1.51335 ms
// set_remove_if : Time : 24.2709 ms +- 0.517165 ms
// sort : Time : 43.735 ms +- 1.44966 ms
// Ratio : 0.2
// set_copy_if : Time : 29.5399 ms +- 1.32403 ms
// set_remove_if : Time : 20.4138 ms +- 0.759438 ms
// sort : Time : 36.4204 ms +- 1.60568 ms
// Ratio : 0.3
// set_copy_if : Time : 32.0227 ms +- 1.25661 ms
// set_remove_if : Time : 22.3386 ms +- 0.950855 ms
// sort : Time : 38.1551 ms +- 1.12852 ms
// Ratio : 0.4
// set_copy_if : Time : 30.2714 ms +- 1.28494 ms
// set_remove_if : Time : 20.8338 ms +- 1.06292 ms
// sort : Time : 35.282 ms +- 2.12884 ms
// Ratio : 0.5
// set_copy_if : Time : 24.3247 ms +- 1.21664 ms
// set_remove_if : Time : 16.1621 ms +- 1.27802 ms
// sort : Time : 27.3166 ms +- 2.12964 ms
// Ratio : 0.6
// set_copy_if : Time : 27.3268 ms +- 1.06058 ms
// set_remove_if : Time : 18.4379 ms +- 1.1438 ms
// sort : Time : 30.6846 ms +- 2.52412 ms
// Ratio : 0.7
// set_copy_if : Time : 30.3871 ms +- 0.887492 ms
// set_remove_if : Time : 20.6315 ms +- 0.899802 ms
// sort : Time : 33.7643 ms +- 2.2336 ms
// Ratio : 0.8
// set_copy_if : Time : 33.3077 ms +- 0.746272 ms
// set_remove_if : Time : 22.9459 ms +- 0.921515 ms
// sort : Time : 37.119 ms +- 2.20924 ms
// Ratio : 0.9
// set_copy_if : Time : 36.0888 ms +- 0.763978 ms
// set_remove_if : Time : 24.7002 ms +- 0.465711 ms
// sort : Time : 40.8233 ms +- 2.59826 ms
// Ratio : 1
// set_copy_if : Time : 21.5609 ms +- 1.48986 ms
// set_remove_if : Time : 14.2934 ms +- 0.535431 ms
// sort : Time : 24.2485 ms +- 0.710269 ms
// Ratio: 0
// set_copy_if : Time: 666.962 ms +- 23.7445 ms
// set_remove_if : Time: 736.088 ms +- 39.8122 ms
// sort : Time: 223.796 ms +- 5.27345 ms
// Ratio: 0.01
// set_copy_if : Time: 60.4075 ms +- 3.4673 ms
// set_remove_if : Time: 43.3095 ms +- 1.31252 ms
// sort : Time: 70.7511 ms +- 2.27826 ms
// Ratio: 0.02
// set_copy_if : Time: 50.2605 ms +- 2.70371 ms
// set_remove_if : Time: 36.2877 ms +- 1.14266 ms
// sort : Time: 62.9786 ms +- 2.69163 ms
// Ratio: 0.03
// set_copy_if : Time: 46.9797 ms +- 2.43009 ms
// set_remove_if : Time: 34.0161 ms +- 0.839472 ms
// sort : Time: 59.5666 ms +- 1.34078 ms
// Ratio: 0.04
// set_copy_if : Time: 44.3423 ms +- 2.271 ms
// set_remove_if : Time: 32.2404 ms +- 1.02162 ms
// sort : Time: 57.0583 ms +- 2.9226 ms
// Ratio: 0.05
// set_copy_if : Time: 41.758 ms +- 2.57589 ms
// set_remove_if : Time: 29.9927 ms +- 0.935529 ms
// sort : Time: 54.1474 ms +- 1.63311 ms
// Ratio: 0.06
// set_copy_if : Time: 40.289 ms +- 1.85715 ms
// set_remove_if : Time: 29.2604 ms +- 0.593869 ms
// sort : Time: 57.5436 ms +- 5.52807 ms
// Ratio: 0.07
// set_copy_if : Time: 40.5035 ms +- 1.80952 ms
// set_remove_if : Time: 29.1187 ms +- 0.63127 ms
// sort : Time: 53.622 ms +- 1.91357 ms
// Ratio: 0.08
// set_copy_if : Time: 38.8139 ms +- 1.9811 ms
// set_remove_if : Time: 27.9989 ms +- 0.600543 ms
// sort : Time: 50.5743 ms +- 1.35296 ms
// Ratio: 0.09
// set_copy_if : Time: 39.0751 ms +- 1.71393 ms
// set_remove_if : Time: 28.2332 ms +- 0.607895 ms
// sort : Time: 51.2829 ms +- 1.21077 ms
// Ratio: 0.1
// set_copy_if : Time: 35.6847 ms +- 1.81495 ms
// set_remove_if : Time: 25.204 ms +- 0.538245 ms
// sort : Time: 46.4127 ms +- 2.66714 ms
Upvotes: 5
Reputation: 2036
Here's something that handles POD and non-POD types with move support. Uses default operator== or a custom equality predicate. Does not require sorting/operator<, key generation, or a separate set. No idea if this is more efficient than the other methods described above.
template <typename Cnt, typename _Pr = std::equal_to<typename Cnt::value_type>>
void remove_duplicates( Cnt& cnt, _Pr cmp = _Pr() )
Cnt result;
result.reserve( std::size( cnt ) ); // or cnt.size() if compiler doesn't support std::size()
std::make_move_iterator( std::begin( cnt ) )
, std::make_move_iterator( std::end( cnt ) )
, std::back_inserter( result )
, [&]( const typename Cnt::value_type& what )
return std::find_if(
std::begin( result )
, std::end( result )
, [&]( const typename Cnt::value_type& existing ) { return cmp( what, existing ); }
) == std::end( result );
); // copy_if
cnt = std::move( result ); // place result in cnt param
} // remove_duplicates
std::vector<int> ints{ 0,1,1,2,3,4 };
remove_duplicates( ints );
assert( ints.size() == 5 );
struct data
std::string foo;
bool operator==( const data& rhs ) const { return this->foo ==; }
mydata{ { "hello" }, {"hello"}, {"world"} }
, mydata2 = mydata
// use operator==
remove_duplicates( mydata );
assert( mydata.size() == 2 );
// use custom predicate
remove_duplicates( mydata2, []( const data& left, const data& right ) { return ==; } );
assert( mydata2.size() == 2 );
Upvotes: 0
Reputation: 866
Here is what WilliamKF is searching for. It uses the erase statement. This code is good for lists but isn t good for vectors. For vectors you should not use the erase statement.
//makes uniques in one shot without sorting !!
template<class listtype> inline
void uniques(listtype* In)
listtype::iterator it = In->begin();
listtype::iterator it2= In->begin();
int tmpsize = In->size();
it2 = it;
if ((*it)==(*it2))
What I have tryed for vectors without using sort is that:
//makes vectors as fast as possible unique
template<typename T> inline
void vectoruniques(std::vector<T>* In)
int tmpsize = In->size();
for (std::vector<T>::iterator it = In->begin();it<In->end()-1;it++)
T tmp = *it;
for (std::vector<T>::iterator it2 = it+1;it2<In->end();it2++)
if (*it2!=*it)
tmp = *it2;
*it2 = tmp;
std::vector<T>::iterator it = std::unique(In->begin(),In->end());
int newsize = std::distance(In->begin(),it);
Somehow it looks like this would work. I tested it a bit maybe can somebody tell if this really works ! This solution doesn t need any greater operator. I mean why use the greater operator for seaching unique elements ? Usage for Vectors:
int myints[] = {21,10,20,20,20,30,21,31,20,20,2};
std::vector<int> abc(myints , myints+11);
Upvotes: 0
Reputation: 227598
Sounds like a job for std::copy_if. Define a predicate that keeps track of elements that already have been processed and return false if they have.
If you don't have C++11 support, you can use the clumsily named std::remove_copy_if and invert the logic.
This is an untested example:
template <typename T>
struct NotDuplicate {
bool operator()(const T& element) {
return s_.insert(element).second; // true if s_.insert(element);
std::set<T> s_;
std::vector<int> uniqueNumbers;
NotDuplicate<int> pred;
std::copy_if(numbers.begin(), numbers.end(),
where an std::ref
has been used to avoid potential problems with the algorithm internally copying what is a stateful functor, although std::copy_if
does not place any requirements on side-effects of the functor being applied.
Upvotes: 19
Reputation: 1319
int unsortedRemoveDuplicates(std::vector<int>& numbers)
std::set<int> seenNums; //log(n) existence check
auto itr = begin(numbers);
while(itr != end(numbers))
if(seenNums.find(*itr) != end(seenNums)) //seen? erase it
itr = numbers.erase(itr); //itr now points to next element
return seenNums.size();
//3 6 3 8 9 5 6 8
//3 6 8 9 5
Upvotes: 9