Reputation: 3172
I have 4e7 std::string
s, each about 3 to 30 characters long, with many duplicates.
I'm putting them in a std::set
.
Calling set::insert
for each string becomes intractably slow well before it would complete with about 1e7 unique strings. So instead I push_back
each string into a vector
, sort()
and unique()
that, and then move the strings into a set
.
It's still slow, but at least it finishes: 4 seconds to accumulate the vector, 30 more for sort()
, 3 more for unique()
.
The bottleneck is sort()
. But I don't need the strings to be lexicographically sorted! I just need duplicate strings to be contiguous, for unique()
. Their order is irrelevant.
Is there a simpler, faster string comparison function for sort()
that I could use instead of its default one?
Or should I build the set faster by iterating over the vector with a hash table on the side, to skip duplicates? Or replace set
with hash_set
or unordered_set
?
Edit: I'm building on Linux with g++ 4.8.4, with the only flags being -std=c++11 -O3
.
Upvotes: 1
Views: 378
Reputation: 3172
@Someprogrammerdude, @J.AntonioPerez, @KennyOstrom: std::unordered_set
is 6x faster. Post an answer and I'll accept it. (This offer may have been lost in all those comments.)
vector<string> v;
loop { v.push_back(my_string[i]; }
Slow original:
sort(v.begin(), v.end());
v.erase(unique(v.begin(), v.end()), v.end());
set<string> noduplicates = set<string>(
make_move_iterator(v.begin()), make_move_iterator(v.end()));
6x faster than the preceding code block:
unordered_set<string> noduplicates =
unordered_set<string>(
make_move_iterator(v.begin()), make_move_iterator(v.end()));
Upvotes: 2