Reputation: 2439
For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.
Here's the simplified pseudocode:
set = # empty set
deduped = []
for string in strings:
if !set.contains(string):
set.add(string)
deduped.add(string)
Here's the simplified C++ for it (roughly):
std::unordered_set <const char *>set;
for (auto &string : strings) {
// do some non-trivial work here that is difficult to parallelize
auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector
However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:
strcmp
).All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.
Upvotes: 2
Views: 109
Reputation: 44258
You can significantly parallelize your task by implementing simplified version of std::unordered_set
manually:
You may need to experiment with bucket size and check how it would affect performance. Logically it should be not too big on one side but not too small on another - to prevent congestion.
Btw from your description it sounds that you load all strings into memory and then eliminate duplicates. You may try to read your data directly to std::unordered_set
instead then you will save memory and increase speed as well.
Upvotes: 0
Reputation: 419
You can try various algorithms and data structures to solve your problem:
Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.
Upvotes: 1