Reputation: 1
I have a CSV table where I have the merged data for 1024 independent variables and 25 dependent variables that are associated with them. For each independent variable (called 1 .. 1024), I have 10 different outcomes. I would like to
It seems like a fairly easy thing to ask of perl, and maybe it would be simple to do with a hash of an array of an array, but I'm still confused about how I could implement something like that for this collection of data.
I found a very helpful Q&A from 2009 on printing matching lines. It works fairly well after some tinkering, but a few issues remain:
I'm fairly sure there must be an easier way to do this, and I would greatly appreciate any help and/or constructive criticism on my (ripped-off) script.
Thank you!
This is what I have so far:
#!/usr/bin/perl
use warnings;
use strict;
unless ($#ARGV == 0) {
print "USAGE: get_best.pl csvfile \n";
exit;
}
### this is a script to get the best "score"
my $input = $ARGV[0];
my $outfile = "bestofthebest.csv";
if (-e $outfile ) {
system "rm $outfile";
}
open(my $fh,'<',"$input") || die "could not open $input"; #try to open input
open (SUMMARY, ">>","$outfile") || die "could not open $outfile"; #open output file for writing
my $this_line = "";
my $do_next = 0;
while (<$fh>) {
chomp($_);
my $last_line = $this_line;
$this_line = $_;
if ($this_line =~ m/Seq/) {
print SUMMARY "$this_line\n";next;
}
my ($compound, $rank, $nnme, $G1, ..., $res1, $res2, $res3, $res4, $res5, $res6 ) = split(/\s+/, $this_line, 26);
my ($compound_old, $rank_old, $nnme_old, $G1_old, ..., $res1_old, $res2_old, $res3_old, $res4_old, $res5_old, $res6_old) = split(/\s+/, $last_line, 26);
foreach ($compound == $compound_old) {
if (($G1 >= $G1_old)){
print SUMMARY "$this_line\n";
print "\n $G1 G1 is >> $G1_old G1_old loop\n";
print "\n compound is $compound G1 is $G1\n";
$do_next = 1;
}
else {
$last_line = "";
$do_next = 0;
}
}
}
close ($fh);
close (SUMMARY);
This is what the input data looks like (I've left off some columns and rows, obviously)
10 8 3 -18.08 -1.4 -16.68 -15.94 -2.13 -9.45
11 10 4 -15.2 3.2 -18.4 -18.02 2.82 -5
11 5 4 -15.22 2.71 -17.92 -15.88 0.66 -4.51
11 7 4 -14.06 3.84 -17.89 -16.7 2.64 -5.73
11 4 4 -16.63 0.48 -17.1 -15.75 -0.87 -5.92
11 6 4 -15.21 1.83 -17.04 -18.41 3.21 -7
11 9 4 -15.18 1.82 -17 -16.56 1.38 -7.09
11 8 4 -14.98 1.93 -16.91 -16.78 1.79 -10.81
11 2 4 -18.75 -1.95 -16.8 -17.83 -0.92 -7.35
11 1 4 -19.67 -3.17 -16.5 -16.4 -3.27 -9.01
11 3 4 -16.69 -0.54 -16.14 -16.35 -0.34 -9.17
12 7 4 -19.54 -1.14 -18.41 -17.74 -1.81 -2.79
12 9 4 -19.09 -1.01 -18.08 -16.01 -3.09 -5.56
12 4 4 -19.48 -2.18 -17.3 -16.34 -3.14 -4
12 2 4 -19.86 -2.77 -17.1 -15.97 -3.9 -2.96
12 8 4 -19.49 -2.45 -17.03 -16.39 -3.1 -7.19
12 1 4 -20.28 -3.33 -16.95 -17.12 -3.16 -5.18
12 3 4 -18.78 -1.93 -16.86 -17.81 -0.98 -5.39
12 5 4 -19.63 -2.86 -16.77 -16.41 -3.22 -6.54
12 6 4 -19.81 -3.25 -16.56 -16.53 -3.27 -7.19
12 10 4 -19.39 -2.95 -16.44 -17.42 -1.97 -7.67
13 1 3 -13.05 6.35 -19.4 -18.71 5.66 -6.43
13 8 3 -21.44 -2.32 -19.11 -17.08 -4.36 -1.93
13 3 3 -16 2.94 -18.94 -19.24 3.24 -2.78
13 2 3 -13.79 4.9 -18.7 -17.35 3.56 -4.72
13 6 3 -22.08 -3.4 -18.68 -20.12 -1.96 -6.74
13 9 3 -18.98 -0.32 -18.66 -15.97 -3.01 -3.06
13 7 3 -20.4 -2.08 -18.32 -18.24 -2.17 -5.71
13 5 3 -19.94 -1.62 -18.32 -19.42 -0.52 -7.44
13 10 3 -19.26 -1.25 -18.01 -17.52 -1.74 -5.68
13 4 3 -17.75 -1.33 -16.42 -17.75 0 -9.15
14 9 3 -22.23 -3.43 -18.79 -16.68 -5.55 -3.91
14 5 3 -21.32 -2.95 -18.37 -18.08 -3.24 -6.03
14 7 3 -24.25 -6.29 -17.96 -18.78 -5.47 -9.21
14 6 3 -21.03 -3.14 -17.89 -19.17 -1.86 -10.11
14 4 3 -21.59 -3.93 -17.67 -19.32 -2.28 -6.55
14 1 3 -22.43 -4.79 -17.63 -18.09 -4.34 -5.63
10 2 3 -10.11 8.94 -19.04 -18.48 8.38 -4.09
11 5 4 -15.22 2.71 -17.92 -15.88 0.66 -4.51
12 7 4 -19.54 -1.14 -18.41 -17.74 -1.81 -2.79
12 6 4 -19.81 -3.25 -16.56 -16.53 -3.27 -7.19
13 8 3 -21.44 -2.32 -19.11 -17.08 -4.36 -1.93
14 9 3 -22.23 -3.43 -18.79 -16.68 -5.55 -3.91
15 10 4 -21.51 -1.51 -20 -17.63 -3.88 -2.45
16 5 4 -17.81 2.56 -20.37 -19.09 1.28 -1.19
16 2 4 -16.61 1.97 -18.58 -21.06 4.45 -6.47
Upvotes: 0
Views: 114
Reputation: 6204
Perhaps the follow will be helpful:
use strict;
use warnings;
my %hash;
while (<DATA>) {
my ( $indVarID, $val ) = (split)[ 0, 3 ];
$hash{$indVarID} = [ $val, $_ ]
if !exists $hash{$indVarID}
or $hash{$indVarID}[0] < $val;
}
print $hash{$_}[1] for sort { $a <=> $b } keys %hash;
__DATA__
11 7 4 -14.06 3.84 -17.89 -16.7 2.64 -5.73
11 4 4 -16.63 0.48 -17.1 -15.75 -0.87 -5.92
11 6 4 -15.21 1.83 -17.04 -18.41 3.21 -7
11 9 4 -15.18 1.82 -17 -16.56 1.38 -7.09
11 8 4 -14.98 1.93 -16.91 -16.78 1.79 -10.81
11 2 4 -18.75 -1.95 -16.8 -17.83 -0.92 -7.35
11 1 4 -19.67 -3.17 -16.5 -16.4 -3.27 -9.01
11 3 4 -16.69 -0.54 -16.14 -16.35 -0.34 -9.17
12 7 4 -19.54 -1.14 -18.41 -17.74 -1.81 -2.79
12 9 4 -19.09 -1.01 -18.08 -16.01 -3.09 -5.56
12 4 4 -19.48 -2.18 -17.3 -16.34 -3.14 -4
12 2 4 -19.86 -2.77 -17.1 -15.97 -3.9 -2.96
12 8 4 -19.49 -2.45 -17.03 -16.39 -3.1 -7.19
12 1 4 -20.28 -3.33 -16.95 -17.12 -3.16 -5.18
12 3 4 -18.78 -1.93 -16.86 -17.81 -0.98 -5.39
12 5 4 -19.63 -2.86 -16.77 -16.41 -3.22 -6.54
12 6 4 -19.81 -3.25 -16.56 -16.53 -3.27 -7.19
12 10 4 -19.39 -2.95 -16.44 -17.42 -1.97 -7.67
13 1 3 -13.05 6.35 -19.4 -18.71 5.66 -6.43
13 8 3 -21.44 -2.32 -19.11 -17.08 -4.36 -1.93
13 3 3 -16 2.94 -18.94 -19.24 3.24 -2.78
13 2 3 -13.79 4.9 -18.7 -17.35 3.56 -4.72
13 6 3 -22.08 -3.4 -18.68 -20.12 -1.96 -6.74
13 9 3 -18.98 -0.32 -18.66 -15.97 -3.01 -3.06
13 7 3 -20.4 -2.08 -18.32 -18.24 -2.17 -5.71
13 5 3 -19.94 -1.62 -18.32 -19.42 -0.52 -7.44
13 10 3 -19.26 -1.25 -18.01 -17.52 -1.74 -5.68
13 4 3 -17.75 -1.33 -16.42 -17.75 0 -9.15
14 9 3 -22.23 -3.43 -18.79 -16.68 -5.55 -3.91
14 5 3 -21.32 -2.95 -18.37 -18.08 -3.24 -6.03
14 7 3 -24.25 -6.29 -17.96 -18.78 -5.47 -9.21
14 6 3 -21.03 -3.14 -17.89 -19.17 -1.86 -10.11
14 4 3 -21.59 -3.93 -17.67 -19.32 -2.28 -6.55
14 1 3 -22.43 -4.79 -17.63 -18.09 -4.34 -5.63
Output:
11 7 4 -14.06 3.84 -17.89 -16.7 2.64 -5.73
12 3 4 -18.78 -1.93 -16.86 -17.81 -0.98 -5.39
13 1 3 -13.05 6.35 -19.4 -18.71 5.66 -6.43
14 6 3 -21.03 -3.14 -17.89 -19.17 -1.86 -10.11
This builds a hash of arrays (HoA), where the key is the independent variable ID and the value is a reference to a two-element list. The zeroth element in the list is the value found in the record's fourth column. The first element is the record.
As records are being read, if a new value for an independent variable is greater than the older value (or if there wasn't an older one), the new value and record are stored in the list.
When done, the keys are numerically sorted and the records which contained the greatest value for each independent variable ID are printed.
Upvotes: 1