Reputation: 911
This is an interview question I found.
The first part is to find the k
most common words in a file. This is a well known question, and the solution can be found here . The run-time of this solution is O(nlogk + k)
when n
is the file size.
Second part is finding the k
most common words in a file, but this time using concurrency. That means, if we can run m
thread in parallel in our system - how can we improve the solution above?
This is where I stuck. First I thought of splitting the file to m
parts and send each part of it to a different thread so he can proccess it. But the problem is that these thread will have to share the same trie, which means they will have to be synchronized. So even if I'll gain better run-time on paper using those m
threads, in reality I'm pretty sure that using synchronization will make the run-time worse than the run-time in the implementation above.
Do you have an idea how to implement this using concurrency in order to improve the run-time?
Upvotes: 1
Views: 166
Reputation: 9621
You could use a CountMinSketch
to approximately maintain the counts in a compact O(1)
form. This provides you the heavy hitters without retaining the words. Multiple sketches can be merged together as simple array summations, assuming the identical hash function is used. That allows threads to process independent sections in parallel. It is similar to the hash table approach, except that you have forgotten the word mapping.
The next step would be to scan the dictionary of words to determine the top k
. If you maintain a dictionary per thread, that could be merged rather than rescanning the file. For each item in that list you query for its frequency, O(1)
, which lets you rank them to obtain the most popular. As we care about the frequency order, not the words themselves, we reduce O(lg k)
to O(lg f)
. In that case words with the same popularity are chained on a list. We can evaluate a new word with the least frequent word in O(1)
if a SkipList
or MinHeap
is used.
The runtime is then the addition of,
O(n/T)
, where n
is the file size and T
is the number of threadsO(T * m)
, where m
is the number of unique words (cardinality) and T
threads might produce all m
words to hash joinO(m * lg f)
, where f
is the top frequencies, to find the most popular wordsThat results in O(n/T) + O(T * m) + O(m * lg f)
where n
is by far the dominant parameter, so it should reduce to O(n)
in practice.
Upvotes: 0
Reputation: 59323
This is a map-reduce operation. Lets say we have a file of length N and M threads:
So with M threads you can do the whole job in O( (N/M) * log K + (K+M) * log M).
If you use a streaming parallel merge for step 5, then you can reduce that to O( (N/M) * log K + K + log M), but that's only better if you have a lot of threads.
Upvotes: 1
Reputation:
What about letting every thread find the k most common in its share of the file, independently, then merge the results ?
Note that I am pretty skeptical about any speedup possible by performing concurrent reads of the file, which is probably the slowest operation in this problem.
Upvotes: 0