John
John

Reputation: 911

Finding k most common words in a file, using concurrency

This is an interview question I found.
The first part is to find the k most common words in a file. This is a well known question, and the solution can be found here . The run-time of this solution is O(nlogk + k) when n is the file size.

Second part is finding the k most common words in a file, but this time using concurrency. That means, if we can run m thread in parallel in our system - how can we improve the solution above?

This is where I stuck. First I thought of splitting the file to m parts and send each part of it to a different thread so he can proccess it. But the problem is that these thread will have to share the same trie, which means they will have to be synchronized. So even if I'll gain better run-time on paper using those m threads, in reality I'm pretty sure that using synchronization will make the run-time worse than the run-time in the implementation above.

Do you have an idea how to implement this using concurrency in order to improve the run-time?

Upvotes: 1

Views: 166

Answers (3)

Ben Manes
Ben Manes

Reputation: 9621

You could use a CountMinSketch to approximately maintain the counts in a compact O(1) form. This provides you the heavy hitters without retaining the words. Multiple sketches can be merged together as simple array summations, assuming the identical hash function is used. That allows threads to process independent sections in parallel. It is similar to the hash table approach, except that you have forgotten the word mapping.

The next step would be to scan the dictionary of words to determine the top k. If you maintain a dictionary per thread, that could be merged rather than rescanning the file. For each item in that list you query for its frequency, O(1), which lets you rank them to obtain the most popular. As we care about the frequency order, not the words themselves, we reduce O(lg k) to O(lg f). In that case words with the same popularity are chained on a list. We can evaluate a new word with the least frequent word in O(1) if a SkipList or MinHeap is used.

The runtime is then the addition of,

  • O(n/T), where n is the file size and T is the number of threads
  • O(T * m), where m is the number of unique words (cardinality) and T threads might produce all m words to hash join
  • O(m * lg f), where f is the top frequencies, to find the most popular words

That results in O(n/T) + O(T * m) + O(m * lg f) where n is by far the dominant parameter, so it should reduce to O(n) in practice.

Upvotes: 0

Matt Timmermans
Matt Timmermans

Reputation: 59323

This is a map-reduce operation. Lets say we have a file of length N and M threads:

  1. Each thread is assigned a portion of the file, and counts the frequency of all the words in its portion. This assumes that the threads can access that data in parallel with bounded contention costs: O(N/M)
  2. Each thread hashes the words into M buckets consistently, and sends counts for bucket to the corresponding thread, so that each thread receives counts for a roughly equal portion of the words: O(N/M)
  3. Each thread combines the counts it received to generate total counts of all of its words. That is combining M results of size O(N/M^2): O(N/M)
  4. Each thread calculates its top K counts and sends them to the result processing thread: O(log K * N/M)
  5. The result processing thread combines the M sorted lists to extract the top K counts. It's important to ensure that this thread can stop reading an input list when it doesn't need the rest of the data. Then: O( (K + M) log M )

So with M threads you can do the whole job in O( (N/M) * log K + (K+M) * log M).

If you use a streaming parallel merge for step 5, then you can reduce that to O( (N/M) * log K + K + log M), but that's only better if you have a lot of threads.

Upvotes: 1

user1196549
user1196549

Reputation:

What about letting every thread find the k most common in its share of the file, independently, then merge the results ?

Note that I am pretty skeptical about any speedup possible by performing concurrent reads of the file, which is probably the slowest operation in this problem.

Upvotes: 0

Related Questions