Reputation: 43

Hash Efficiency with tons of data

I have data like this:

I would like to sum all the values up IF the value of first column matches, the result would be like this,

I have my code here,

while (<DATA>) 
{
my ($a, $b) = split;
$hash{$a}  += $b;
}

foreach $a (sort keys %hash) 
{
$b = $hash{$a};
print OUT "$a $b\n";
}

It works with sample data (around 100MB) but it seems to take ages to deal with my real data (around 100G). Are there any ways to optimize my codes?

Appreciate any advises in advance!

Upvotes: 3

Answers (3)

Reputation: 26121

If your data looks like you show us it seems you have it sorted by key so hash is not necessary at all.

perl -anE'if($k!=$F[0]){say"$k $s"if$.>1;$k=$F[$s=0]}$s+=$F[1]}{say"$k $s"'

will do the trick. I doubt it will be slow.

Upvotes: 1

Reputation: 57600

Hashes are quite efficient. They are probably the best solution to your problem. However, there could be exceptions, depending on your data:

If all keys are integers in a (more or less) continuous range, then you can use an array instead, which is even more efficient than a hash:

while (<DATA>) {
  my ($k, $v) = split;
  $array[$k] += $v;
}

for my $i (grep defined $array[$_], 0 .. $#array) {
  print "$i $array[$i]\n";
}

If the keys are already sorted, we don't need any intermediate data structure. Just accumulate the sum into a scalar. When the key changes, output the sum of the last key.
If you have multiple files, you can apply your algorithm for each of these files in parallel and combine the results. This lets your code run in logarithmic time instead of linear time (aka. a big win). Either split the large file into smaller chunks, our do some magic with seek and tell to partition the file. The more busy processors you have, the faster your file will be summarized. With one caveat: It might very well be that I/O is your bottleneck. If this task has to be done regulary, using a SSD (instead of a HDD) might drastically improve performance.

Upvotes: 2

Reputation: 129393

As others stated, your most likely bottleneck isn't hashes or Perl, but disk access.

Split up the file into smaller chunks. (using standard Unix utils if you can).

Store them on SEPARATE IO sources (different disks ideally on different controllers, ideally on different PCs).

If you have only a few keys (e.g. >100-1000 rows per key), simply run chunks separately, then concatenate them all into 100x smaller file, and process that one file as a whole.
Otherwise, synchronize the processing using a database to store sums.

Upvotes: 3