user1002288
user1002288

Reputation: 5040

Design an algorithm, find the most frequently used word in a book

An interview question:

Find the most frequently used word in a book.

My idea:

Use a hash table, traverse and mark the hash table.

If the book's size is known, if any word is found to be used > 50%, then skip any new words in the following traversal and only count old words. What if the book size is unknown?

It is O(n) and O(n) time and space.

Any better ideas?

Thanks

Upvotes: 5

Views: 3166

Answers (6)

Don Reba
Don Reba

Reputation: 14051

If you are dealing with a book, then you know the vocabulary and the approximate word frequencies. Even if you are not given this information up front, you can get a good estimate by scanning a random sample.

For the exact answer, I would use a perfect hash function of the k most common words. A perfect hash function requires O(k) memory and guarantees fast worst-case O(1) lookup.

For the uncommon words, I would use a priority queue implemented as a heap or a self-balancing tree. A regular hash table might also be a good choice.

Upvotes: 0

Abhijit
Abhijit

Reputation: 63777

Usually Heap is the data-structure which suits well when we have to determine something like most/least used.

Even Python;s Counter.nlargest which is used for these purposes is implemented through the Heap Data-structure.

A Binary Heap Data-structure has the following Complexity

CreateHeap - O(1)
FindMin - O(1)
deleteMin - O(logn)
Insert - O(logn)

I ran a comparition on Hash (using default dictionary in Python) and Heap (using Collections.Counter.nlargest in python) and the Hash is fairing slightly better than Heap.

>>> stmt1="""
import collections, random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
somehash=collections.defaultdict(int)
for d in somedata:
    somehash[d]+=1
maxkey=0
for k,v in somehash.items():
    if somehash[maxkey] > v:
        maxkey=k
"""
>>> stmt2="""
import collections,random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
collections.Counter(somedata).most_common(1)
"""
>>> t1=timeit.Timer(stmt=stmt1)
>>> t2=timeit.Timer(stmt=stmt2)
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=10)/10)
38168.96 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=10)/10)
33600.80 usec/pass

Upvotes: 2

thekoalaz
thekoalaz

Reputation: 76

This is actually a classic example of map reduce.

The example in the wikipedia page will give you the word count of each unique word, but you can easily add a step in the reduce step that keeps track of the current most common word(with some kind of mutex to deal with concurrency issues).

If you have a distributed cluster of machines or a highly parallelized computer this will run much faster than using the hash table.

Upvotes: 2

Spike
Spike

Reputation: 5130

Your solution is correct, fast, and probably the best/easiest from a practical standpoint.

The other poster's solutions have worse time complexities than your solution. For a hash, as you are using, the time complexity is indeed O(n). Each insertion is O(1) and there are n words, so the insertion phase costs O(n). Iterating through and finding the max is then O(n). The space is also O(n) as you mentioned.

Note that you will not be able to terminate your algorithm early using Chris's solution because searching your hash table is costly and there is no way for you to perform this in O(1) time after each insertion.

A heap will cost more in time because you need to maintain the heap during each insertion. A heap insertion is O(log(n)) and thus the total cost for insertion will be O(nlog(n)).

Upvotes: 1

SmacL
SmacL

Reputation: 22932

To determine complexity I think you need to consider two variables, n = total number of words, m = number of unique words. I imagine the best case complexity will come out close to O(n log(m)) for speed, and O(m) for storage, assuming each time you iterate over each of n words, and build and search based on a hash table or other such structure which eventually contains m elements.

Upvotes: 2

Chris Shain
Chris Shain

Reputation: 51369

There is a generalization of your optimization- if the book size is known and any word you have seen has a count > the remaining number of words + the next-highest count, your current highest-counted word is the answer.

Upvotes: 1

Related Questions