Sorting Huge Hashes in Perl

Question

I am analyzing the frequency of occurrence of groups of words which occur together in a sentence.

Each group consists of 3 words and we have to calculate their frequency.

Example: This is a good time to party because this is a vacation time.

Expected Output:

this is a - 2
is a good - 1
a good time - 1

and so on.

I have written a script which works good and it prints the frequency and sorts it by descending order.

It works by reading one line at a time from the file. For each line, it will convert them to lowercase, split it into individual words and then form an array out of it.

Then, we pick 3 words at a time starting from the left and keep forming a hash storing the count of occurrence. Once done, we shift the left most element in the array and repeat till the time our array consists of more than 3 words.

Question Updated:

The problem is I want to use this script on a file consisting of more than 10 million lines.

After running some tests I observed that it will not work if the number of lines in the input file are more than 400K.

How can I make this script more memory efficient?

Thanks to fxzuz for his suggestions but now I want to make this script work with larger files :)

#!/usr/bin/perl

use strict;
use warnings;

print "Enter the File name: ";
my $input = ;
chomp $input;

open INPUT, '<', $input 
    or die("Couldn't open the file, $input with error: $!
");

my %c;

while (my $line = ) {

    chomp $line;
    my @x = map lc, split /\W+/, join "", $line;

    while (@x>3) {

        $c{"@x[0..2]"}++;
        shift @x;
    }
}

foreach $key (sort {$c{$b} <=> $c{$a}} keys %c) {

    if($c{$key} > 20) {

        print $key." - ".$c{$key}."
";
    }
}

close INPUT;

This works good and it will print the groups of words in descending order of frequency. It will only print those groups of words which occur more than 20 times.

Now, how do I make this work on a file consisting of more than 1 million or 10 million lines?

I also checked the memory and CPU usage of perl while running this script using top command in Linux and observed that the CPU usage reaches 100% and the memory usage is close to 90% while the script runs on a file consisting of 400K lines.

So, it is not feasible to make it work with a file consisting of 1 million lines. Because the perl process will hang.

How can I make this code more memory efficient?

mvp · Accepted Answer

Apparently, your code is written correctly and will work, but only as long as your data set is not VERY big. If you have a lot of input data (and seems like you DO), it is possible that sorting phase may fail due to lack of memory. If you cannot increase your memory, the only solution is to write your data to disk - in text or database format.

Text format: you can simply write your triplets as you go into text file, one line per triplet. Doing this will increase output size by factor of 3, but it should be still manageable. Then, you can simply use command line gnu sort and uniq tools to get your desirable counts, something like this:

text2triplet.pl (you may want to store your output for sort into a file and not pipe it if it is very big)


Database format: use DBD::SQLite and create table like this:

CREATE TABLE hash (triplet VARCHAR, count INTEGER DEFAULT 0);

CREATE INDEX idx1 ON hash (triplet);

CREATE INDEX idx2 ON hash (count);



INSERT your triplets into that table as you go, and increase count for duplicates. After data is processed, simply

 SELECT * FROM hash
 WHERE count > 20
 ORDER BY count DESC


and print it out.
Then you can DROP your hash table or simply delete whole SQLite database altogether.

Both of these approaches should allow you to scale to almost size of your disk, but database approach may be more flexible.

Sorting Huge Hashes in Perl

Question Updated:

Answers (2)

Related Questions