Reputation: 27
This is part of a tab separated file that I generated:
Sample Gene RPKM
MK_001_27 HPSE2 17.3767266340978
MK_003_11 HPSE2 51.1152373602497
MK_002_5 HPSE2 16.5372913024845
MK_001_23 HPSE2 25.8985857481585
MK_001_23 HPSE2 21.6045197173757
MK_001_27 HPSE2 139.450963428357
MK_001_23 HPSE2 36.7603351866179
MK_003_9 HPSE2 25.2860867858956
MK_001_22 HPSE2 100.915250867745
MK_003_9 HPSE2 35.078964327254
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 13.5865540939141
Any gene is present in at least one sample. It may be present multiple times in the same sample.
I want to generate the mean of the Reads Per Kilobase per Million (RPKM) value and use it to replace multiple value for any given gene from the same sample
For example
MK_003_9 HPSE2 35.078964327254
MK_003_9 HPSE2 13.5865540939141
MK_003_9 HPSE2 25.2860867858956
becomes
MK_003_9 HPSE2 24.650535069
s there a way that you know of that could resolve this in Unix or in Perl?
Pseudocode
-For each line, combine column 0 (sample) and column 1 (gene) to create a "key"
though not always unique
-Run through the file to see if this "key" is present somewhere else
-Count the number of times this "key" is present
-If present > 1 time, calculate the mean of the RPKM values by sum()/count
-Create occurrence with this "key" and the new RPKM value
-Delete(?) the other corresponding "keys"
Upvotes: 0
Views: 119
Reputation: 126742
This is quite straightforward if you accumulate the contents of your data file into a Perl hash. You don't make it clear whether there may be multiple genes in a single file, so I've coded this for multiple samples and multiple genes
The program expects the input file as a parameter on the command line, and prints its output to STDOUT which may be redirected as normal
use strict;
use warnings 'all';
use List::Util 'sum';
print scalar <>; # Copy the header line
my %mean_rpkm;
while ( <> ) {
my ($sample, $gene, $rpkm) = split;
push @{ $mean_rpkm{$gene}{$sample} }, $rpkm;
}
for my $gene ( sort keys %mean_rpkm ) {
for my $sample ( sort keys %{ $mean_rpkm{$gene} } ) {
my $rpkm = $mean_rpkm{$gene}{$sample};
my $mean = sum(@$rpkm) / @$rpkm;
printf "%s\t%s\t%.3f\n", $sample, $gene, $mean;
}
}
Sample Gene RPKM
MK_001_22 HPSE2 100.915
MK_001_23 HPSE2 28.088
MK_001_27 HPSE2 78.414
MK_002_5 HPSE2 16.537
MK_003_11 HPSE2 51.115
MK_003_12 HPSE2 34.079
MK_003_9 HPSE2 24.651
The output from my solution is essentially unordered. I have sorted the genes and samples lexically, but you may want them in the same order as they appear in the input file. If so then you should say so
A useful intermediate solution is to install Sort::Naturally
(it's not a core module) and add
use Sort::Naturally 'nsort';
to the top of the above program. If you then replace both occurrences of sort
with nsort
then you will get this output. It may not be idea, but it's an improvement because it sorts MK_003_9
before MK_003_11
which a simple lexical sort doesn't do
Sample Gene RPKM
MK_001_22 HPSE2 100.915
MK_001_23 HPSE2 28.088
MK_001_27 HPSE2 78.414
MK_002_5 HPSE2 16.537
MK_003_9 HPSE2 24.651
MK_003_11 HPSE2 51.115
MK_003_12 HPSE2 34.079
Upvotes: 3
Reputation: 53498
The core problem here is - if you have a condition like 'if there's multiple, delete the repeats' - you don't know if that condition applies until you've parsed the whole file.
You can read it all in, do some calculations (that are idempotent) and then dump some output. A bit like this:
#!/usr/bin/perl
use strict;
use warnings;
my %stuff;
#iterate line by line of the special 'DATA' filehandle.
#(You probably want <> instead)
while (<DATA>) {
#split on any whitespace.
my ( $sample, $gene, $RPKM ) = split;
#stuff the values into a list.
push( @{ $stuff{$sample}{$gene} }, $RPKM );
}
#iterate processed results and print them
foreach my $sample ( sort keys %stuff ) {
foreach my $gene ( sort keys %{ $stuff{$sample} } ) {
#sum and divide for average.
my $sum = 0;
$sum += $_ for @{ $stuff{$sample}{$gene} };
print join "\t", $sample, $gene, $sum / @{ $stuff{$sample}{$gene} },
"\n";
}
}
__DATA__
MK_001_27 HPSE2 17.3767266340978
MK_003_11 HPSE2 51.1152373602497
MK_002_5 HPSE2 16.5372913024845
MK_001_23 HPSE2 25.8985857481585
MK_001_23 HPSE2 21.6045197173757
MK_001_27 HPSE2 139.450963428357
MK_001_23 HPSE2 36.7603351866179
MK_003_9 HPSE2 25.2860867858956
MK_001_22 HPSE2 100.915250867745
MK_003_9 HPSE2 35.078964327254
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 13.5865540939141
This gives:
MK_001_22 HPSE2 100.915250867745
MK_001_23 HPSE2 28.0878135507174
MK_001_27 HPSE2 78.4138450312274
MK_002_5 HPSE2 16.5372913024845
MK_003_11 HPSE2 51.1152373602497
MK_003_12 HPSE2 34.078813048573
MK_003_9 HPSE2 24.6505350690212
Note - it's sorted alphanumerically, because hashes are intrinsically unordered. IF you need to maintain ordering, it's a bit more complicated, but not impossible.
Upvotes: 1
Reputation: 241968
Use List::Util to get the sum
function. Store the RPKMs in a hash of arrays, the key being the sample. At the end, sum the arrays and divide by numbers of their elements:
perl -MList::Util=sum -lane '
next if 1 == $.;
push @{ $h{ $F[0] } }, $F[2];
}{
print $_, "\t", sum(@{ $h{$_} }) / @{ $h{$_} }
for sort keys %h;
' < input-file > output-file
-n
reads the input line by line.-a
splits each line on whitespace into the @F array.-l
adds newlines to prints.$.
is the line number, it equals 1 for the header line.Upvotes: 1