MyFirstName MyLastName
MyFirstName MyLastName

Reputation: 207

aggregate totals when key changes in Perl

I have an input file with the following format

ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4

I need to aggregate the col2 for each key (column1) so the result is:

ant,1
bat,3
cat,7
dog,4

Other considerations:

  1. Assume that the input file is sorted
  2. The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
  3. Each input line should be processed as we read it, and move to the next line
  4. I need to write the results to an outFile
  5. I need to do this in Perl, but a pseudo-code or algorithm would help just as fine

Thanks!

This is what I came up with... want to see if this can be written better/elegant.

open infile, outFile

prev_line = <infile>;
print_line = $prev_line;

while(<>){
   curr_line = $_;

   @prev_cols=split(',', $prev_line);
   @curr_cols=split(',', $curr_line);

   if ( $prev_cols[0] eq $curr_cols[0] ){
      $prev_cols[1] += curr_cols[1];
      $print_line = "$prev_cols[0],$prev_cols[1]\n";
      $print_flag = 0;
   }
   else{
      $print outFile "$print_line";
      $print_flag = 1;
      $print_line = $curr_line;
   }
   $prev_line = $curr_line;
}

if($print_flag = 1){
   print outFile "$curr_line";
}   
else{
   print outFile "$print_line";
}

Upvotes: 2

Views: 552

Answers (3)

Greg Bacon
Greg Bacon

Reputation: 139471

Use Perl’s awk mode.

  • -a

turns on autosplit mode when used with a -n or -p. An implicit split command to the @F array is done as the first thing inside the implicit while loop produced by the -n or -p.

perl -ane 'print pop(@F), "\n";'

is equivalent to

while (<>) {
  @F = split(' ');
  print pop(@F), "\n";
}

An alternate delimiter may be specified using -F.

All that’s left for you is to accumulate the sums in a hash and print them.

$ perl -F, -lane '$s{$F[0]} += $F[1];
    END { print "$_,$s{$_}" for sort keys %s }' input

Output:

ant,1
bat,3
cat,7
dog,4

Upvotes: 1

thb
thb

Reputation: 14454

#!/usr/bin/perl
use warnings;
use strict;
use integer;

my %a;
while (<>) {
    my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
    $a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;

This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.

Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.

On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).

#!/usr/bin/perl
use warnings;
use strict;
use integer;

my $last_animal = undef;
my $total_for_the_last_animal = 0;

sub start_new_animal ($$) {
    my $next_animal = shift;
    my $n = shift;
    print "$last_animal,$total_for_the_last_animal\n"
      if defined $last_animal;
    $last_animal = $next_animal;
    $total_for_the_last_animal = $n;
}

while (<>) {
    my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
    if (
        defined($n) && defined($animal) && defined($last_animal)
          && $animal eq $last_animal
    ) { $total_for_the_last_animal += $n; }
    else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;

Upvotes: 2

starbolin
starbolin

Reputation: 840

It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.

Upvotes: 0

Related Questions