javaalgorithmcomplexity-theorytime-complexity

Reputation: 575

Returning 10 most frequently used words in a document in O(n)

How can I design an algorithm which can return the 10 most frequently used words in a document in O(n) time? If additional space can be used.

I can parse and place the words in a hash map with count . But next I have to sort the values to get the most frequent ones . Also I have to have a mapping btw the values -> Key which cannot be maintained since values may be repeating.

So how can I solve this ?

Upvotes: 4

Answers (5)

valdo

Reputation: 12951

It may be done in O(n) if you use the correct data structure.

Consider a Node, consisting of 2 things:

A counter (initially set to 0).
An array of 255 (or whatever number of characters) pointers to Node. All the pointers are initially set to NULL.

Create a root node. Define a "current" Node pointer, set it to root node initially. Then walk through all the characters of the document and do the following:

If the next characters is not a space - pick the appropriate pointer from the array of the current node. If it's NULL - allocate it. The current Node pointer is updated.
If it's a space (or whatever word delimiter) - increment the counter of the "current" Node. Then reset the "current" Node pointer to point to the root node.

By such you build a tree in O(n). Every element (both node and leave) denote a specific word, together with its counter.

Then transverse the tree to find the node with the largest counter. It's also O(n), since the number of elements in the tree is not bigger than O(n).

Update:

The last step is not mandatory. Actually the most common word may be updated during the character processing. The following is a pseudo-code:

struct Node
{
    size_t m_Counter;
    Node* m_ppNext[255];
    Node* m_pPrev;

    Node(Node* pPrev) :m_Counter(0)
    {
        m_pPrev = pPrev;
        memset(m_ppNext, 0, sizeof(m_ppNext));
    }
    ~Node()
    {
        for (int i = 0; i < _countof(m_ppNext) i++)
            if (m_ppNext[i])
                delete m_ppNext[i];
    }

};

Node root(NULL);
Node* pPos = &root;
Node* pBest = &root;
char c;

while (0 != (c = GetNextDocumentCharacter()))
{
    if (c == ' ')
    {
        if (pPos != &root)
        {
            pPos->m_Counter++;

            if (pBest->m_Counter < pPos->m_Counter)
                pBest = pPos;

            pPos = &root;
        }
    } else
    {
        Node*& pNext = pPos->m_ppNext[c - 1];
        if (!pNext)
            pNext = new Node(pPos);
        pPos = pNext;
    }
}

// pBest points to the most common word. Using pBest->m_pPrev we iterate in reverse order through its characters

Upvotes: 2

Tyler Durden

Reputation: 11562

The fastest approach is to use a radix tree. You can store the count of words at the leaf of the radix tree. Keep a separate list of the 10 most frequent words and their occurrence count along with a variable that stores the threshold value necessary to get into this list. Update this list as items are added to the tree.

Upvotes: 2

smk

Reputation: 5912

Maintaining a map of (word,count) will be O(n).

Once the map is built, iterate over the keys and retrieve ten most common keys.

O(n) + O(n)

-- But not exactly happy with this soln coz of the extra amount of external memory needed.

Upvotes: 0

sampson-chen

Reputation: 47367

Here's a simple algorithm:

Read one word at a time through the document. O(n)
Build a HashTable using each word. O(n)
- Use the word as the key. O(1)
- Use the number of times you've seen this word as the value. O(1)
- (e.g. If you are adding the key to the hashtable, then value is 1; if you already have the key in the hashtable, increase its associated value by 1) O(1)
Create a pair of arrays of size 10 (i.e. String words[10] / int count[10], or use a < Pair >), use this pair to keep track of the 10 most frequent words and their word counts in the next step. O(1)
Iterate through the completed HashTable once: O(n)
- If the current word has a higher word count than an entry in the array pair, replace that particular entry and shift everything down 1 slot. O(1)
Output the pair of arrays. O(1)

O(n) Run-time.

O(n) Storage for the HashTable + Arrays

(Side note: You can just think of HashTable as a dictionary: a way to store key:value pairs where keys are unique. Technically HashMaps imply asynchronous access, and HashTable imply synchronous.)

Upvotes: 5

byteherder

Reputation: 331

I would use a ArrayList and a HashTable.

Here is the algorithm I am thinking of,

Loop through all the word in the document.

if (HashTable.contains(word) )
    increment count for that word in the HashTable;
else
    ArrayList.add(word);
    HashTable.add(word);
    word count in HashTable = 1;

After looping through the whole document,

Loop through ArrayList<word>
    Retrieve the word count for that word from the HashTable;
    Keep a list of the top 10 words;

The running time should be O(n) to construct the HashTable and ArrayList. Making the top 10 list should be O(m), where m is the number of unique words. O(n+m) where n>>m --> O(n)

Upvotes: 0

Returning 10 most frequently used words in a document in O(n)

Answers (5)

Related Questions