Reputation: 1634

Find the number occurences of each word in a document?

I was asked the question in an interview. The interviewer told me to assume that there exists a function say getNextWord() to return the next word in a given document. My task was to design a data structure to implement the task, and give an algorithm that constructs a list of all words with their frequencies.

Being from a C++ background, my answer was to create a multimap of string and then insert all words in it, and later display the count of it. I was however told later, to do this in a more generic way. By generic he meant that he didn't want me to use a library feature. Also I guess a multimap is implemented internally as a 2-3 tree or so, so for the multimap solution to be generic I would need to code the 2-3 tree as well.

Although tries did come to mind, implementing one during an interview was out of question for me. So, I just wanted to know if there are better ways of achieving it? Or is there a way to implement it in a smooth manner using tries?

Upvotes: 1

Answers (3)

OSH

Reputation: 2937

I think the simplest solution would be a a Trie. O(N) is given in this case (both for insertion and getting the count). Just store the count in an additional space at every node.

Basically each node in the tree contains 26 links to 26 possible children (1 for each letter) + 1 counter (for words the are terminated in the current node) . Just look at the link for a graphic image of a trie.

Upvotes: 0

amit

Reputation: 178521

Any histogram based algorithm would be both effient and generic in here. The idea is simple: build a histogram from the data. A generic interface for a histogram is a Map<String,Integer>

Iterate the document once (with your nextDoc() method), while maintaining your histogram.

The best implementation for this interface, in terms of big O notations - would probably be to use a trie, and in each leaf node - add the counter of occurances.

Getting the actual (word,number) pairs from the trie will be done by a simple DFS on the trie.

This solution gives you O(n * |S|) time complexity, where |S| is the average size for a string.

The insertion algorithm for each word:
Each time you add a new word: check if it already exists, if it does - increase the counter, else - add the word to the dictionary with a counter value of 1.

Upvotes: 3

Sergei Danielian

Reputation: 5025

I'd try to implement a B-Tree (or smth quite similar) to store all the words. Therefore I could easily find a next word, if already have it and increase associated counter in the node. Or just insert a new one.

The time complexity in that case would be: O(nlogn), where n is all words count and logn is a Big-Oh for such kind of tree.

Upvotes: 2

Find the number occurences of each word in a document?

Answers (3)

Related Questions