snowy_squirrel
snowy_squirrel

Reputation: 27

Calculate mean value of column if multiple occurrences in another column

This is part of a tab separated file that I generated:

Sample      Gene    RPKM
MK_001_27   HPSE2   17.3767266340978
MK_003_11   HPSE2   51.1152373602497
MK_002_5    HPSE2   16.5372913024845
MK_001_23   HPSE2   25.8985857481585
MK_001_23   HPSE2   21.6045197173757
MK_001_27   HPSE2   139.450963428357
MK_001_23   HPSE2   36.7603351866179
MK_003_9    HPSE2   25.2860867858956
MK_001_22   HPSE2   100.915250867745
MK_003_9    HPSE2   35.078964327254
MK_003_12   HPSE2   34.078813048573
MK_003_9    HPSE2   13.5865540939141

Any gene is present in at least one sample. It may be present multiple times in the same sample.

I want to generate the mean of the Reads Per Kilobase per Million (RPKM) value and use it to replace multiple value for any given gene from the same sample

For example

MK_003_9    HPSE2   35.078964327254
MK_003_9    HPSE2   13.5865540939141
MK_003_9    HPSE2   25.2860867858956

becomes

MK_003_9    HPSE2   24.650535069

s there a way that you know of that could resolve this in Unix or in Perl?

Pseudocode

-For each line, combine column 0 (sample) and column 1 (gene) to create a "key" 
 though not always unique
-Run through the file to see if this "key" is present somewhere else
-Count the number of times this "key" is present
-If present > 1 time, calculate the mean of the RPKM values by sum()/count
-Create occurrence with this "key" and the new RPKM value
-Delete(?) the other corresponding "keys"

Upvotes: 0

Views: 119

Answers (3)

Borodin
Borodin

Reputation: 126742

This is quite straightforward if you accumulate the contents of your data file into a Perl hash. You don't make it clear whether there may be multiple genes in a single file, so I've coded this for multiple samples and multiple genes

The program expects the input file as a parameter on the command line, and prints its output to STDOUT which may be redirected as normal

use strict;
use warnings 'all';

use List::Util 'sum';

print scalar <>;  # Copy the header line

my %mean_rpkm;

while ( <> ) {
    my ($sample, $gene, $rpkm) = split;
    push @{ $mean_rpkm{$gene}{$sample} }, $rpkm;
}

for my $gene ( sort keys %mean_rpkm ) {

    for my $sample ( sort keys %{ $mean_rpkm{$gene} } ) {
        my $rpkm = $mean_rpkm{$gene}{$sample};
        my $mean = sum(@$rpkm) / @$rpkm;
        printf "%s\t%s\t%.3f\n", $sample, $gene, $mean;
    }
}

output

Sample      Gene    RPKM
MK_001_22   HPSE2   100.915
MK_001_23   HPSE2   28.088
MK_001_27   HPSE2   78.414
MK_002_5    HPSE2   16.537
MK_003_11   HPSE2   51.115
MK_003_12   HPSE2   34.079
MK_003_9    HPSE2   24.651



Update

The output from my solution is essentially unordered. I have sorted the genes and samples lexically, but you may want them in the same order as they appear in the input file. If so then you should say so

A useful intermediate solution is to install Sort::Naturally (it's not a core module) and add

use Sort::Naturally 'nsort';

to the top of the above program. If you then replace both occurrences of sort with nsort then you will get this output. It may not be idea, but it's an improvement because it sorts MK_003_9 before MK_003_11 which a simple lexical sort doesn't do

Sample      Gene    RPKM
MK_001_22   HPSE2   100.915
MK_001_23   HPSE2   28.088
MK_001_27   HPSE2   78.414
MK_002_5    HPSE2   16.537
MK_003_9    HPSE2   24.651
MK_003_11   HPSE2   51.115
MK_003_12   HPSE2   34.079

Upvotes: 3

Sobrique
Sobrique

Reputation: 53498

The core problem here is - if you have a condition like 'if there's multiple, delete the repeats' - you don't know if that condition applies until you've parsed the whole file.

You can read it all in, do some calculations (that are idempotent) and then dump some output. A bit like this:

#!/usr/bin/perl

use strict;
use warnings;

my %stuff;

#iterate line by line of the special 'DATA' filehandle. 
#(You probably want <> instead)
while (<DATA>) {
   #split on any whitespace. 
   my ( $sample, $gene, $RPKM ) = split;
   #stuff the values into a list. 
   push( @{ $stuff{$sample}{$gene} }, $RPKM );
}

#iterate processed results and print them
foreach my $sample ( sort keys %stuff ) {
   foreach my $gene ( sort keys %{ $stuff{$sample} } ) {
      #sum and divide for average. 
      my $sum = 0;
      $sum += $_ for @{ $stuff{$sample}{$gene} };
      print join "\t", $sample, $gene, $sum / @{ $stuff{$sample}{$gene} },
        "\n";
   }
}

__DATA__
MK_001_27   HPSE2   17.3767266340978
MK_003_11   HPSE2   51.1152373602497
MK_002_5    HPSE2   16.5372913024845
MK_001_23   HPSE2   25.8985857481585
MK_001_23   HPSE2   21.6045197173757
MK_001_27   HPSE2   139.450963428357
MK_001_23   HPSE2   36.7603351866179
MK_003_9    HPSE2   25.2860867858956
MK_001_22   HPSE2   100.915250867745
MK_003_9    HPSE2   35.078964327254
MK_003_12   HPSE2   34.078813048573
MK_003_9    HPSE2   13.5865540939141

This gives:

MK_001_22   HPSE2   100.915250867745    
MK_001_23   HPSE2   28.0878135507174    
MK_001_27   HPSE2   78.4138450312274    
MK_002_5    HPSE2   16.5372913024845    
MK_003_11   HPSE2   51.1152373602497    
MK_003_12   HPSE2   34.078813048573 
MK_003_9    HPSE2   24.6505350690212    

Note - it's sorted alphanumerically, because hashes are intrinsically unordered. IF you need to maintain ordering, it's a bit more complicated, but not impossible.

Upvotes: 1

choroba
choroba

Reputation: 241968

Use List::Util to get the sum function. Store the RPKMs in a hash of arrays, the key being the sample. At the end, sum the arrays and divide by numbers of their elements:

perl -MList::Util=sum -lane '
    next if 1 == $.;
    push @{ $h{ $F[0] } }, $F[2];
    }{
    print $_, "\t", sum(@{ $h{$_} }) / @{ $h{$_} }
        for sort keys %h;
' < input-file > output-file
  • -n reads the input line by line.
  • -a splits each line on whitespace into the @F array.
  • -l adds newlines to prints.
  • $. is the line number, it equals 1 for the header line.

Upvotes: 1

Related Questions