What are some efficient ways to de-dupe a set of 1 million strings?

Question

For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.

Here's the simplified pseudocode:

set = # empty set
deduped = []
for string in strings:
    if !set.contains(string):
        set.add(string)
        deduped.add(string)

Here's the simplified C++ for it (roughly):

std::unordered_set set;
for (auto &string : strings) {
  // do some non-trivial work here that is difficult to parallelize
  auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector

However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:

Use a different C++ set implementation (e.g., abseil's)
Insert into the set concurrently (however, per the comment in the C++ implementation, this is hard. Also, there will be performance overhead to parallelizing)
Because the set of strings changes very little across runs, perhaps cache whether or not the hash function generates no collisions. If it doesn't generate any (while accounting for the changes), then strings can be compared by their hash during lookup, rather than for actual string equality (strcmp).
Storing the de-duped strings in a file across runs (however, although that may seem simple, there are lots of complexities here)

All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.

Grandbrain · Accepted Answer

You can try various algorithms and data structures to solve your problem:

Try using a prefix tree (trie), a suffix machine, a hash table. A hash table is one of the fastest ways to find duplicates. Try different hash tables.
Use various data attributes to reduce unnecessary calculations. For example, you can only process subsets of strings with the same length.
Try to implement the "divide and conquer" approach to parallelize the computations. For example, divide the set of strings by the number of subsets equal to the hardware threads. Then combine these subsets into one. Since the subsets will be reduced in size in the process (if the number of duplicates is large enough), combining these subsets should not be too expensive.

Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.

What are some efficient ways to de-dupe a set of > 1 million strings?

Answers (2)

Related Questions

What are some efficient ways to de-dupe a set of &gt; 1 million strings?

Answers (2)

Related Questions

What are some efficient ways to de-dupe a set of > 1 million strings?