Dvorak
Dvorak

Reputation: 169

Efficient algorithm to find frequency of words, max word frequency and total word count

I am trying to find frequencies of each word occurring in a section of a file and total word count of that section. For example if there's a file: file.txt:

This is a file section which is part of the file.
# This is another file section which is part of the same file separated by hash.

I wish to find frequency of each word, which word has maximum frequency and total word count in each section in an efficient manner. Such that:

In Section 1: This-1; is-2; a-1; file-2; section-1; which-1; part-1; of-1; the-1| Total Words: 11| Word having Maximum Frequency: is,file
In Section 2: This-1; is-2; another-1; file-2; section-1; which-1; part-1; of-1; the-1; same-1; by-1; hash-1;| Total Words:15| Words having Maximum Frequency: is,file

So far, I have come up with a loop which goes through each word, increases the Total Word Count, then put each word in a Key/Value Pair having frequency of each word. I don't know about maximum frequency. Is there any efficient algorithm which I could try to use?

I wish to do so in Java. So, I was thinking of using HashMaps but any better approach is welcome.

Thanks :)

Upvotes: 1

Views: 1990

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 133995

You can easily keep track of the current maximum as you update each word. For example, your loop for each section:

Initialize HashMap of Words
maxWord = null  // word with current max count
while not end of section
    get word
    if word in Words
        increment count of word in HashMap
    else
        add to Words with count of 1
    if maxWord == null || Words[word].Count > Words[maxWord].Count
        maxWord = word
end while

When you complete processing the section, you have the frequencies of all the words, and maxWord contains the word with the largest count.

The entire algorithm is O(n). You can do it in a single pass of the file.

Much simpler, though, to just build your HashMap of words and, at the end of each section, go sequentially through it to pick out the one that has the max count. That, too, is considered O(n).

Upvotes: 1

Related Questions