Reputation: 372664
Because the trie data structure has such a huge branching factor and each subtree is totally independent of the others, it seems like there ought to be a way to enormously speed up trie construction for a given dictionary by adding all the words in parallel.
My initial idea on how to do this was the following: associate a mutex with each pointer in the trie (including the pointer to the root) and then have each thread follow the normal algorithm for inserting a word into the trie. However, before following any pointers, a thread must first acquire the lock on that pointer so that if it needs to add a new child node to the trie, it can do so without introducing any data races.
The catch with this approach is that it uses an enormous number of locks - one for each pointer in the trie - and does an enormous number of acquires and releases - one for each character in each input string.
Is there a way to build a trie in parallel without using nearly as many locks?
Upvotes: 9
Views: 2541
Reputation: 452
Depending on what your dictionary looks like, you might not need the locks at all, if you can get each thread to build independent subtrees. If this isn't an online algorithm, presort the words by prefix (first letter if you have < 26 threads, first and second if you have more or you know that the data isn't balanced, for example, 90% of the words start with A). Basically, this would be an O(n) operation, where you do one pass to count how many words there are that start with a given letter, then one pass to sort (along the lines of a radix sort by your chosen prefix). Then, divide the prefixes between the threads and have each thread build these independent sub trees. Finally, have one thread add each of these subtrees to the root. I'll walk through an example below.
Your dictionary:
Bark
Apple
Cookie
And
Baby
Corn
Blue
Cake
Bacon
After sorting:
Apple
And
Bark
Baby
Blue
Bacon
Corn
Cookie
Cake
Then we divide prefixes among the threads. For this example we have 3 threads, which get the prefixes [A][B][C], and build the following trees:
A --| B -------| C -------| P N |-- A ---| L O ---| A P D R B C U O R K L K Y O E K N E E N I E
And then you have one thread combine these at the root like:
|----------- Root------------------| A --| B -------| C -------| P N |-- A ---| L O ---| A P D R B C U O R K L K Y O E K N E E N I E
I hope that made sense.
Advantages to this method: The threads work essentially independently, and you don't have the overhead of having to deal with acquiring and releasing locks.
Drawbacks to this method: If you don't know anything about the dictionary, there may be a severe workload imbalance, and in the worst case (Say, all the words start with 'A') it reverts to basically being a single thread building a tree. There are a few ways to make this better, for example, you can add some checks when you sort such that if the workload is severely imbalanced when dealing with a single letter prefix, to resort by the first 2 letters, but really you can't guarantee that it will be balanced.
You also may have idle threads at times if say you have 20 threads and sort by first letter, then you there will be some 6 threads that have to do two subtrees, while 14 of them sit idle for half the time. You may be able to further subdivide the subtrees to deal with this, but that's extra time spent on a preprocessing step.
Anyway, no guarantees that this is faster than your method, but its something to consider.
Upvotes: 1
Reputation: 363487
An obvious lock-free algorithm would be:
Assuming a uniform distribution of prefixes, this can give you a linear speedup up to the size of the alphabet to the power k.
Upvotes: 10
Reputation: 372664
It just occurred to me that this can be made lock-free by using atomic test-and-set operations on the pointers instead of locks. Specifically, if a thread wants to follow a pointer, it does the following:
Depending on the hardware, this could be much faster, since it avoids the effort of locking and unlocking all the time and ensures that no thread is ever waiting indefinitely.
One downside is that the number of allocations involved goes up, since multiple threads might all try allocating a node to put in the trie at a certain spot, but only one can put it there. Fortunately, this can be mitigated by the following optimization: if a thread ever allocates a node unnecessarily, rather than immediately freeing it, it just stores the node in temporary space. If later on it needs to allocate a new node, it can use the cached one. If not, it can free it at the very end.
Hope this helps!
Upvotes: 4
Reputation: 178411
Well, there is the obvious trade off between fine VS coarse granularity of setting a lock for a set of nodes (rather then one).
A simple way to do it is via hashing - have m
different locks, and for each node you want to access acquire the lock numbered hash(node) % m
.
Note that this approach is basically a generalization of the suggested approach (with perfect hashing and n == m
), and of the serial approach (with m == 1
).
Another thing that might be utilized is optimistic design - if the approach will actually increase the performance depends on the distribution of the dictionary and the size of the trie of course, and might help a lot if collisions tend to be very rare (which might be the case for a dictionary of very long words).
The idea is to just add the words to the trie without any synchronization, and if you encounter a collision - roll back to the last known stable state (this of course requires taking a snap shot of the data, and might be infeasible if we are speaking about streams of data that cannot be stored).
Upvotes: 1