Reputation: 2157

sum different values per group

I have a tab-delimited file (with header line) in which I want to make the sum of some values based on the presence of data from another column.

This is how my table looks like:

 C1 C2  C3  C4
 a  x   e   1
 b  x       3
 c  y       1
 d  z   f   4

For all the identical values in C2 I want to make the sum of the corresponding values in C3. So the new table should look like this:

 C1    C2    C3    C4    C5
 a     x     e     1     4
 b     x           3     4
 c     y           1     1
 d     z     f     4     4

So I want to keep the original structure of my table (columns C1,C2,C3,C4 and identical number of lines), but at the end I want to add another column, where in this case for line a and b the values from C4 are summed up and put in C5, since the values for C2 are identical. The values in C5 for line c&d are the same as C4, since the y&z in C2 are unique.

Can anyone help me how to do this in Perl? I suppose I should loop through the file and create a variable $sum in which by going through each line he adds up the number of C4 and then puts in C5.

Upvotes: 0

Answers (4)

Lee Duhem

Reputation: 15121

Read the data file line by line, push every line to an array, and record the value of column 4 in a hash, using column 2 as key. When you finished, iterate that array, print its element and corresponding column 4 one by one.

#!/usr/bin/perl

use strict;
use warnings;

my $header = <DATA>;
chomp $header;
$header = join "\t", (split /\t/, $header), 'C5';

my (%record, @lines);

while (<DATA>) {
    chomp;
    my @cols = split /\t/;
    $record{$cols[1]} += $cols[3];
    push @lines, \@cols;
}

print "$header\n";
foreach (@lines) {
    print (join "\t", (@$_, $record{$_->[1]}), "\n");
}

__DATA__
 C1 C2  C3  C4
 a  x   e   1    
 b  x       3    
 c  y       1    
 d  z   f   4

Upvotes: 5

mpapec

Reputation: 50637

perl -lane'
  BEGIN { $, ="\t" }
  print(@F, "C4"),next if $. ==1;
  $s{ $F[1] } += $F[2];
  push @r, [ @F ];
  END {
    print @$_, $s{$_->[1]} for @r;
  }
' file

output

C1      C2      C3      C4
a       x       1       4
b       x       3       4
c       y       1       1
d       z       4       4

script version,

use strict;
use warnings;
local $, ="\t";
local $\ = "\n";

my %s;
my @r;
while (<>) {
  chomp;
  my @F = split;
  print(@F, "C4"),next if $. ==1;
  $s{ $F[1] } += $F[2];
  push @r, [ @F ];
}

print @$_, $s{$_->[1]} for @r;

Upvotes: 1

Adrian Frühwirth

Reputation: 45526

If you don't mind reading the file twice this can be achieved very easily using awk as well:

$ awk -v OFS='\t' 'NR==FNR{a[$2]+=$3;next}{print $1,$2,$3,$2=="C2"?"C4":a[$2]}' t.txt t.txt
C1      C2      C3      C4
a       x       1       4
b       x       3       4
c       y       1       1
d       z       4       4

and if you don't mind fixing the column header manually the following suffices:

$ awk -v OFS='\t' 'NR==FNR{a[$2]+=$3;next}{print $1,$2,$3,a[$2]}' t.txt t.txt
C1      C2      C3      0
a       x       1       4
b       x       3       4
c       y       1       1
d       z       4       4

Upvotes: 0

nikonan

Reputation: 166

One simple solution would be to put the values from C3 into an array and then print each line's contents up to C3. For column i of C4, compare the contents of array[i-1] to array[i] and array[i+1] to array[i] i.e. compare the value of the array of the particular line with that of the previous and of the next line. If they are equal, add them, print them and move to next line.

This can be easily altered for cases where C3 is equal for more than 2 consecutive values.

Upvotes: 0

sum different values per group

Answers (4)

Related Questions