Reputation: 169
I am trying to find frequencies of each word occurring in a section of a file and total word count of that section. For example if there's a file: file.txt:
This is a file section which is part of the file.
# This is another file section which is part of the same file separated by hash.
I wish to find frequency of each word, which word has maximum frequency and total word count in each section in an efficient manner. Such that:
In Section 1: This-1; is-2; a-1; file-2; section-1; which-1; part-1; of-1; the-1| Total Words: 11| Word having Maximum Frequency: is,file
In Section 2: This-1; is-2; another-1; file-2; section-1; which-1; part-1; of-1; the-1; same-1; by-1; hash-1;| Total Words:15| Words having Maximum Frequency: is,file
So far, I have come up with a loop which goes through each word, increases the Total Word Count, then put each word in a Key/Value Pair having frequency of each word. I don't know about maximum frequency. Is there any efficient algorithm which I could try to use?
I wish to do so in Java. So, I was thinking of using HashMaps but any better approach is welcome.
Thanks :)
Upvotes: 1
Views: 1990
Reputation: 133995
You can easily keep track of the current maximum as you update each word. For example, your loop for each section:
Initialize HashMap of Words
maxWord = null // word with current max count
while not end of section
get word
if word in Words
increment count of word in HashMap
else
add to Words with count of 1
if maxWord == null || Words[word].Count > Words[maxWord].Count
maxWord = word
end while
When you complete processing the section, you have the frequencies of all the words, and maxWord
contains the word with the largest count.
The entire algorithm is O(n). You can do it in a single pass of the file.
Much simpler, though, to just build your HashMap
of words and, at the end of each section, go sequentially through it to pick out the one that has the max count. That, too, is considered O(n).
Upvotes: 1