meisel
meisel

Reputation: 2439

What are some efficient ways to de-dupe a set of > 1 million strings?

For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.

Here's the simplified pseudocode:

set = # empty set
deduped = []
for string in strings:
    if !set.contains(string):
        set.add(string)
        deduped.add(string)

Here's the simplified C++ for it (roughly):

std::unordered_set <const char *>set;
for (auto &string : strings) {
  // do some non-trivial work here that is difficult to parallelize
  auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector

However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:

All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.

Upvotes: 2

Views: 109

Answers (2)

Slava
Slava

Reputation: 44258

You can significantly parallelize your task by implementing simplified version of std::unordered_set manually:

  1. Create arbitrary amount of buckets (probably should be proportional or equal to amount of threads in thread pool).
  2. Using thread pool calculate hashes of your strings in parallel and split strings with their hashes btw buckets. You may need to lock individual buckets when adding your strings there but operation should be short and/or you may use lock free structure.
  3. Process each bucket individually using your thread pool - compare hashes and if they equal compare string themselves.

You may need to experiment with bucket size and check how it would affect performance. Logically it should be not too big on one side but not too small on another - to prevent congestion.

Btw from your description it sounds that you load all strings into memory and then eliminate duplicates. You may try to read your data directly to std::unordered_set instead then you will save memory and increase speed as well.

Upvotes: 0

Grandbrain
Grandbrain

Reputation: 419

You can try various algorithms and data structures to solve your problem:

  1. Try using a prefix tree (trie), a suffix machine, a hash table. A hash table is one of the fastest ways to find duplicates. Try different hash tables.
  2. Use various data attributes to reduce unnecessary calculations. For example, you can only process subsets of strings with the same length.
  3. Try to implement the "divide and conquer" approach to parallelize the computations. For example, divide the set of strings by the number of subsets equal to the hardware threads. Then combine these subsets into one. Since the subsets will be reduced in size in the process (if the number of duplicates is large enough), combining these subsets should not be too expensive.

Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.

Upvotes: 1

Related Questions