Reputation: 21
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
A_1 18.66530716 0 10.45969216 52.71893547 40.04726048 32.16758825 38.27754435
A_2 25.19816467 0 12.5516306 37.95763354 28.39714834 25.7340706 37.581589
A_3 61.5006053 0 6.807664053 4.57493135 23.69514333 9.304974679 29.44245014
A_4 46.71317515 4.988346264 21.47872616 36.08568845 7.47600779 18.34871344 75.02919728
A_5 38.12488272 0 0 28.71499464 19.82997811 19.46785483 66.33787183
A_6 44.16019386 3.313750449 10.70121259 38.35466425 8.691025042 13.40792311 42.72152213
B_1 38.39720331 13.32601073 0 19.28006783 9.985810405 9.803455466 95.44530538
B_2 46.53021582 1.899838598 24.54086634 13.74342921 24.20186228 6.988206544 47.62545788
B_3 48.42890507 0 6.0308135 20.26433556 20.99119304 10.30393217 64.20344867
A_7 32.10687649 0 20.56239825 23.03079775 9.542753971 10.5395511 44.46513374
B_4 34.82673166 0 6.122746633 39.08916191 8.524472297 14.64540603 54.99744731
B_5 32.49685303 2.910517165 15.66506159 35.79294964 8.723952928 10.7058016 52.11522135
B_6 30.38974634 0 0 30.51870034 10.53778987 17.24225836 50.36058827
B_7 59.60856159 0 8.097826192 19.0468412 2.818575518 11.06841746 10.77608287
A_8 36.07790915 6.260541956 0 31.70212496 14.07396097 4.605650219 67.26011453
C_1 0 17.27445836 0 382.0309737 1.849224149 0 0
C_2 344.0389416 119.4010562 32.13217433 0 22.36821531 285.4766232 21.37974841
C_3 235.5547989 37.86357293 22.23167043 2.490045661 2.579360621 30.38709443 14.79226135
C_4 0 2.801263518 0 334.3615367 0 0 0
C_5 9.397916894 128.2900334 187.2504332 25.16745451 22.81140838 14.39668285 0
Here is the data matrix. Row is variable and column is sample ID.
A_1 - A_8 is clusterA, B_1 - B_7 is clusterB, C_1 - C_5 is clusterC.
Now I wanna calculate the mean or median of A_1 - A_8 as the value of clusterA, getting the median result as:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.10139593 0 10.58045238 33.89390671 16.95196954 15.87831827 43.59332793
Could anyone help me solve this problem using perl script?
Upvotes: 1
Views: 485
Reputation: 40778
Here is an example of how you can calculate the medians of the clusters:
use feature qw(say);
use strict;
use warnings;
my $fn = 'data.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %clusters;
while (my $line = <$fh>) {
chomp $line;
my ($id, @cols) = split " ", $line;
die "Bad format" if !@cols;
if ( $id =~ /^([A-Za-z]+)_/ ) {
$id = $1;
}
else {
die "Bad ID";
}
if (!exists $clusters{$id} ) {
$clusters{$id} = [];
}
my $samples = $clusters{$id};
for my $i (0..$#cols) {
push @{ $samples->[$i] }, $cols[$i];
}
}
close $fh;
print $header;
for my $id (sort keys %clusters) {
my $samples = $clusters{$id};
my @items;
push @items, sprintf "cluster%s", $id;
for my $sample (@$samples) {
my $median = calculate_median( $sample );
push @items, $median;
}
say join "\t", @items;
}
sub calculate_median {
my ( $sample ) = @_;
my @sorted = sort {$a <=> $b} @$sample;
my $N = scalar @sorted;
my $i = int ($N/2);
if ( $N % 2 == 0 ) {
my $m1 = $sorted[$i-1];
my $m2 = $sorted[$i];
return ($m1 + $m2)/2;
}
else {
return $sorted[$i];
}
}
Output:
Var_ID sample1 sample2 sample3 sample4 sample5 sample6 sample7
clusterA 37.101395935 0 10.580452375 33.893906705 16.95196954 15.878318275 43.593327935
clusterB 38.39720331 0 6.122746633 20.26433556 9.985810405 10.7058016 52.11522135
clusterC 9.397916894 37.86357293 22.23167043 25.16745451 2.579360621 14.39668285 0
Upvotes: -1
Reputation:
Calculate both mean and median:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use List::Util qw(sum);
use POSIX qw(floor ceil);
my %data = ();
my %avg = ();
my %median = ();
while (<>) {
next if $. == 1;
my @fields = split;
my $cluster = substr($fields[0],0,1);
$data{$cluster} = [] unless exists($data{$cluster});
push @{$data{$cluster}}, [ @fields[1..$#fields] ];
}
for my $cluster (keys(%data)) {
for my $sampleNo (0..scalar(@{$data{$cluster}[0]})-1) {
my @samples = map { $_->[$sampleNo] } @{$data{$cluster}};
my $cnt = @samples;
$avg{$cluster}[$sampleNo] = sum(@samples)/$cnt;
my @sorted = sort @samples;
$median{$cluster}[$sampleNo] = (@sorted[floor(($cnt+1)/2)-1] +
@sorted[ceil(($cnt+1)/2)-1])/2;
}
}
print "Mean\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } @{$avg{$cluster}})),"\n";
}
print "Median\n";
for my $cluster (sort keys (%data)) {
print join("\t", ($cluster,map {sprintf "%15.9f",$_ } @{$median{$cluster}})),"\n";
}
Output:
perl test.pl <sample.txt
Mean
A 37.818389312 1.820329834 10.320165477 31.642471301 18.969159754 16.697040778 50.139427875
B 41.525459546 2.590909499 8.636759179 25.390783670 12.254808048 11.536782519 53.646221676
C 117.798331479 61.126076882 48.322855592 148.810002114 9.921641692 66.052080096 7.234401952
Median
A 37.101395935 0.000000000 11.626421595 37.021660995 34.222204410 22.600962715 43.593327935
B 38.397203310 0.000000000 24.540866340 20.264335560 24.201862280 14.645406030 52.115221350
C 235.554798900 17.274458360 187.250433200 25.167454510 2.579360621 14.396682850 0.000000000
Upvotes: 2