Partial heap sorting to find k most frequent words in 5GB file

I know what algorithm I'd like to use but want to know what I'd have to change since the file is so large.

I want to use a hash to store the frequencies of the words and use a min-heap to store the most frequent words and adjust the min-heap accordingly as I loop through the words. This should take O(nlogk) I think. How will my algorithm need to be changed if I have too much data to store in memory. This is an issue I have trouble understanding in general, not just for this specific question but I'm just giving context so that it might help with the explanation.

Upvotes: 3

Answers (3)

Jim Mischel

Reputation: 134095

Added after your comment that you need to calculate the frequencies.

You don't say how many words you expect are in the data, or what constitutes a word. If it's English text, I'd be surprised to see a half million words. And there certainly won't be a billion words in 5 gigabytes of text. But the technique doesn't really change, regardless of how many words there are.

You start by building a dictionary or hash map that contains key value pairs: word, count. As you read each word, look it up in the dictionary. If it's there, increase its count. If it's not there, add it with a count of 1.

If you have a lot of memory or relatively few words, it'll all fit into memory. If so, you can do the heap thing that I describe below.

If your memory fills up, then you simply write the key value pairs out to a text file, one word per line, like this:

word1, count
word2, count

Then clear your dictionary and keep going, adding words or increasing their counts. Repeat as necessary for each block of words until you've reached the end of the input.

Now you have a huge text file that contains word/count pairs. Sort it by word. There are many external sorting tools that'll do that. Two that come to mind are the Windows SORT utility and the GNU sort. Both can easily sort a very large file of short lines.

Once the file is sorted by word, you'll have:

word1, count
word1, count
word2, count
word3, count
word3, count
word3, count

Now it's a simple matter of going sequentially through the file, accumulating counts for words. At each word break, check its count against the heap as described below.

This whole process takes some time, but it works quite well. You can speed it some by sorting the blocks of words and writing them to individual files. Then, when you've reached the end of input you do an N-way merge on the several blocks. That's faster, but forces you to write a merge program, unless you can find one. If I were doing this once, I'd go for the simple solution. Were I to be doing it often, I'd spend the time to write a custom merge program.

After you've computed the frequencies ...

Assuming your file contains the words and their frequencies, and all you want to do is get the k words with the highest frequencies, then yes it's O(n log k), and you don't have to store all of the items in memory. Your heap only requires k items.

The idea:

heap = new minheap();
for each item
    // if you don't already have k items on the heap, add this one
    if (heap.count < k)
        heap.Add(item)
    else if (item.frequency > heap.Peek().frequency)
    {
        // The new item's frequency is greater than the lowest frequency
        // already on the heap. Remove the item from the heap
        // and add the new item.
        heap.RemoveRoot();
        heap.Add(item);
    }

After you've processed every item, the heap will contain the k items that have the highest frequencies.

Upvotes: 4

Juan Lopes

Reputation: 10595

I think there is no deterministic way to do that without having the entire file in memory (or making some expensive kind of merge sort).

But there are some good probabilistic algorithms. Take a look at the Count-Min Sketch.

There is a great implementation of this and other algorithms, in this library.

Explaining the merge sort thing: if your file is already sorted, you can find the k most frequent pretty easily with a min-heap. Yes, a min-heap to be able to discard the less frequent term when you find one more competitive. You can do this because you can know the frequency of the current word without having to read the entire file. If you file is unsorted, you must keep an entire list, because the most frequent term may appear everywhere in the file, and be discarded as "non-competitive" too soon.

You can do a merge sort with limited memory pretty easily, but it's a I/O intensive operation and may take a while. Actually you can use any kind of External Sort.

Upvotes: 4

ElKamina

Reputation: 7817

You can use selection algorithm (http://en.wikipedia.org/wiki/Selection_algorithm ) to calculate the kth largestnumber. Then do a linear scan and select only k large numbers.

In practice you might want to start with an estimated range where kth min false into and continue from there on. Eg. read first M numbers and calculate estimated kth max = (k*M/N)th max in M numbers. If you think data is biased (i.e. partially sorted), then pick those M numbers randomly.

Upvotes: 0

Partial heap sorting to find k most frequent words in 5GB file

Answers (3)

Related Questions