Neal
Neal

Reputation: 338

Perl: Find maximum value of a hash and compute averages

After a big break of ~6 months I am back in the world of Perl and Bioinformatics, interning under a different scientist. But the very first assignment is unlike any I had encountered last time, so while I have made some progress, I haven't been able to tackle the problem in its entirety. I am also trying to revise whatever I learnt last time as fast as possible, because I completely lost touch with programming these last 6 months. The dataset looks like the following:

NR_046018   DDX11L1     ,   0   0   1   1   1   1   1   1   1   1   0   0   0   0   1.44    2.72    3.84    4.92    5.6 6.64    7.08    9.12    9.56    8.28    7.16    6.08    5.4 4.36    3.92    1.88    0   0   0.76    1   1   1   1.2 2   2   2   1.72    2   2   2   1.8 1   1.88    2.4 3   3.36    5   6   6   6.72    6.12    5.6 5.44    5.56    5   4.04    5   4.28    4   4   3.08    2.08    1.68    1.96    1.44    3   3.68    4   4.16    5   4.32    4.8 6.16    6   6.28    6.92    7.84    7   7.32    7.2 5.96    5   4.52    4.08    3   3   4.04    4.12    4.44    4   3.52    3.4 4   4   2.64    1.88    1   1   1   0.64    1   1   1.24    2   2.92    3   3   2.96    2   2   2.56    2   1.08    2.12    3   3   3   3   2.6 3   4.64    3.88    3.72    4   4   4.96    4.6 4   2.36    2   1.28    1   1   0.04    0   0.24    1.08    2.68    3.84    4.12    5.72    6   6   5.76    4.92    3.32    3.12    2.88    2.08    2   2   2   2   2   1.44    2.92    3.04    4.28    5.8 7.8 9.48    10.52   13.04   12.08   11.6    11.72   11  9.2 7.52    7.12    7.08    7.08    8.32    7   6.6 7.6 8.04    8.36    6.72    7.88    7.72    8.4 9.24    8.88    8.96    9.88    10.08   9.24    9.28    10.16   11.04   10.52   10  8.56    8   7.8 7.72    6.44    4.32    4   4   3.72    3.68    3.68    3.28    5.56    7.36    9.48    10  10.52   11  12.16   11.96   9.44    8.64    7.52    7   6.48    6   5   5.12    6.28    6   5.52    6   6.68    6.08    7.52    8.16    7.72    8.52    8.56    9.2 9.16    8.92    7.44    6   5   3.48    2.92    2.16    2   2   1.2 1   1   1   1.24    1.64    1   1   1.96    2   2   2   1.76    1   1   1   0.52    1.76    3.64    5.12    6   6   6   6   5.52    4.24    2.36    0.88    0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   1   1   1.44    2.44    3.68    5.4 6.88    7   6   6.52    6.76    6.56    5.32    3.6 2.92    3   3.72    3.96    3.8 3   3   3   2.2 2.4 2.28    1.52    1   1   1   1.72    2   1.6 1   1   1   1   1   0.28    0.92    2   2   2.72    3.64    4   4.84    5   4.08    3   3   2.68    2.36    2   1.16    1   1   2   4.92    4.6 4   4   4   4   4.32    4   1.08    1   1.52    2   2   2   1.68    1   1   1.32    1.48    1   1   1.52    2   2   2   1.68    1   1   1.88    1.48    1   1   1   1   1   1   0.12    0.4 1   1   1.2 3.88    4   5   5   4.6 4   4   3.8 2.08    2   1   1   1.44    2.4 3
NR_047520   LOC643837   ,   3   2.2 0.2 0   0   0.28    1   1   1   1   2.2 4.8 5   5.32    5   5   5   5   3.8 1.2 1   0.4 0   0   0   0   0   1   1   1   1   1   1   1   1.56    1   1   1   1   1   1   1   0.44    0.68    1   1.52    3   3.6 4.96    6.8 9   8.32    8.72    8.48    7   7.4 8.8 7.92    7.12    8.84    8.56    9.4 10.2    10  7.24    6.44    6.76    6.16    5.72    4.96    4.8 5.16    6   5.84    4.12    3   3   2.64    2.56    3.08    3   4.16    5   6.72    7   7.16    7.44    5.76    5   4.56    4   3.68    5   5.4 5.52    6   6   5.28    5   3.6 2   2.08    1.48    1   2   2   2   2   2   1.36    1   1   0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   0   1.16    2   2   2   2   2.88    3   3   1.84    1   2   2   2.04    2.12    2   2   2   2   1   1.28    1.96    1.36    2.76    3   3   3   3   2.72    2   1.64    0.76    1   1.36    2   2   2   2   2   1.48    1   0.64    0   0.08    1   1   1.08    2   2   2   2   2.68    2   2   2.16    3.4 4   4   4.2 4.24    4   5.68    6.52    4.6 4   4   3.8 3.8 4   3.12    2.24    2.6 3   4   4   3.2 3   2.2 2   1.4 1.84    1.24    2   2   2   2   2   2   1.16    0.76    0   0   0   0   0   0   0   0.36    1   1.68    2   2   2.92    5.4 6.76    7.64    7   6.88    7   7.36    7.92    6.24    5.92    7.04    9.52    11.52   12.88   14.8    16.36   19.88   22.24   20  19.36   16.92   15.24   13.84   10.88   8.24    5.08    4.96    3.12    3   2.88    2   2.8 2.96    4   4.44    5   6   6   6   5.12    3.28    2   1.56    1   0.08    1.68    2   2   2.84    3   3   3.8 3.92    2.32    2   2.2 2.16    2   2   1.2 1   1   1   0.8 0   0   0   0.72    2.88    3   3   3   3   3   3   2.28    0.12    0   0.52    1   1   1   1   1.44    2   2   1.48    1   1   1   1.56    1.56    1   1   1   1   1   1   0.44    0.8 1.48    3   3   3   3   3   3.56    3.2 2.76    2   2   2   2   2.68    2.44    2   1.76    1   1.4 2   2   1.56    2   2   2   2   2.04    2   2   1.76    1   1   1   1   0.56    0   0   0   0   0   0   0   0   0.72    1.52    2   2   2   2   2   2   1.28    0.48                                                                            

1. What is needed

  1. For each row in the data file, find the maximum value from the range of numbers.
  2. Once the maximum has been found for all the rows, find average maximum.

2. Strategy I was thinking

  1. Separate the non numerical part from the non-numerical part into "keys" of a hash.
  2. Put the numerical part into the "values" of a hash.
  3. Assign the "values" into array @values
  4. Use module use List::Util qw(max) to find maximum value from the array
  5. Store these maximum values in another array and find average from this array.

3. Code written so far

use warnings;
use List::Util qw(max);

#Input filename
$file = 'test1.data';

#Open file
open I, '<', $file or die;

#Separate data into keys and values, based on ','
chop (%hash = map { split /\s*,\s*/,$_,2 } grep (!/^$/,<I>));
print "$_ => $hash{$_}\n" for keys %hash; #Code is working fine till here

#Create a values array
@values = values %hash;
foreach $value(@values){
 print "The values are : ", $value,"\n";
}

4. The Problem

Beyond this, I am not able to figure out how to add each "individual" array element into a new array so that I may use the max function.

What I mean is that for example, the first array element in @values contains data like 0 0 1 1 3 4.4. The second array element might have data like 3 2.2 0.28 1 1 4.8. So I need to put each of these array elements into a new array, each element going into a different array so that I may be able to use the max function.

5. Points to Note

  1. Most of the rows contain 400 numbers, some have a little less than that, but never more than 400.

  2. There are a total of 23,558 rows.

  3. File is a .txt file and all the numbers in each row are tab delimited.

I would be grateful to anyone who would be kind enough to point me in the right direction, or perhaps provide a better code to tackle the problem as mentioned in 1.

Upvotes: 5

Views: 2065

Answers (2)

dan1111
dan1111

Reputation: 6566

Here is a fun solution. If you are using List::Util, you might as well use sum also.

#!usr/bin/perl
use strict;
use warnings;
use List::Util qw/max sum/;

my %line_max = map {
    /([\w\s]*?)\s*,\s*(.*)/ or die "bad line";
    $1 => max split ' ', $2
} <DATA>;

print "$_: $line_max{$_}\n" foreach (keys %line_max);

my $avg_max = sum (values %line_max) / scalar (values %line_max);
print "average: $avg_max\n";

__DATA__
NR_046018   DDX11L1     ,   0   0   1   1   1   1   1   1   1   1   0   0   0   0   1.44    2.72    3.84    4.92    5.6 6.64    7.08    9.12    9.56    8.28    7.16    6.08    5.4 4.36    3.92    1.88    0   0   0.76    1   1   1   1.2 2   2   2   1.72    2   2   2   1.8 1   1.88    2.4 3   3.36    5   6   6   6.72    6.12    5.6 5.44    5.56    5   4.04    5   4.28    4   4   3.08    2.08    1.68    1.96    1.44    3   3.68    4   4.16    5   4.32    4.8 6.16    6   6.28    6.92    7.84    7   7.32    7.2 5.96    5   4.52    4.08    3   3   4.04    4.12    4.44    4   3.52    3.4 4   4   2.64    1.88    1   1   1   0.64    1   1   1.24    2   2.92    3   3   2.96    2   2   2.56    2   1.08    2.12    3   3   3   3   2.6 3   4.64    3.88    3.72    4   4   4.96    4.6 4   2.36    2   1.28    1   1   0.04    0   0.24    1.08    2.68    3.84    4.12    5.72    6   6   5.76    4.92    3.32    3.12    2.88    2.08    2   2   2   2   2   1.44    2.92    3.04    4.28    5.8 7.8 9.48    10.52   13.04   12.08   11.6    11.72   11  9.2 7.52    7.12    7.08    7.08    8.32    7   6.6 7.6 8.04    8.36    6.72    7.88    7.72    8.4 9.24    8.88    8.96    9.88    10.08   9.24    9.28    10.16   11.04   10.52   10  8.56    8   7.8 7.72    6.44    4.32    4   4   3.72    3.68    3.68    3.28    5.56    7.36    9.48    10  10.52   11  12.16   11.96   9.44    8.64    7.52    7   6.48    6   5   5.12    6.28    6   5.52    6   6.68    6.08    7.52    8.16    7.72    8.52    8.56    9.2 9.16    8.92    7.44    6   5   3.48    2.92    2.16    2   2   1.2 1   1   1   1.24    1.64    1   1   1.96    2   2   2   1.76    1   1   1   0.52    1.76    3.64    5.12    6   6   6   6   5.52    4.24    2.36    0.88    0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   1   1   1.44    2.44    3.68    5.4 6.88    7   6   6.52    6.76    6.56    5.32    3.6 2.92    3   3.72    3.96    3.8 3   3   3   2.2 2.4 2.28    1.52    1   1   1   1.72    2   1.6 1   1   1   1   1   0.28    0.92    2   2   2.72    3.64    4   4.84    5   4.08    3   3   2.68    2.36    2   1.16    1   1   2   4.92    4.6 4   4   4   4   4.32    4   1.08    1   1.52    2   2   2   1.68    1   1   1.32    1.48    1   1   1.52    2   2   2   1.68    1   1   1.88    1.48    1   1   1   1   1   1   0.12    0.4 1   1   1.2 3.88    4   5   5   4.6 4   4   3.8 2.08    2   1   1   1.44    2.4 3
NR_047520   LOC643837   ,   3   2.2 0.2 0   0   0.28    1   1   1   1   2.2 4.8 5   5.32    5   5   5   5   3.8 1.2 1   0.4 0   0   0   0   0   1   1   1   1   1   1   1   1.56    1   1   1   1   1   1   1   0.44    0.68    1   1.52    3   3.6 4.96    6.8 9   8.32    8.72    8.48    7   7.4 8.8 7.92    7.12    8.84    8.56    9.4 10.2    10  7.24    6.44    6.76    6.16    5.72    4.96    4.8 5.16    6   5.84    4.12    3   3   2.64    2.56    3.08    3   4.16    5   6.72    7   7.16    7.44    5.76    5   4.56    4   3.68    5   5.4 5.52    6   6   5.28    5   3.6 2   2.08    1.48    1   2   2   2   2   2   1.36    1   1   0   0   0.68    1   1   1   1   1   1   1   0.32    0   0   0   1.16    2   2   2   2   2.88    3   3   1.84    1   2   2   2.04    2.12    2   2   2   2   1   1.28    1.96    1.36    2.76    3   3   3   3   2.72    2   1.64    0.76    1   1.36    2   2   2   2   2   1.48    1   0.64    0   0.08    1   1   1.08    2   2   2   2   2.68    2   2   2.16    3.4 4   4   4.2 4.24    4   5.68    6.52    4.6 4   4   3.8 3.8 4   3.12    2.24    2.6 3   4   4   3.2 3   2.2 2   1.4 1.84    1.24    2   2   2   2   2   2   1.16    0.76    0   0   0   0   0   0   0   0.36    1   1.68    2   2   2.92    5.4 6.76    7.64    7   6.88    7   7.36    7.92    6.24    5.92    7.04    9.52    11.52   12.88   14.8    16.36   19.88   22.24   20  19.36   16.92   15.24   13.84   10.88   8.24    5.08    4.96    3.12    3   2.88    2   2.8 2.96    4   4.44    5   6   6   6   5.12    3.28    2   1.56    1   0.08    1.68    2   2   2.84    3   3   3.8 3.92    2.32    2   2.2 2.16    2   2   1.2 1   1   1   0.8 0   0   0   0.72    2.88    3   3   3   3   3   3   2.28    0.12    0   0.52    1   1   1   1   1.44    2   2   1.48    1   1   1   1.56    1.56    1   1   1   1   1   1   0.44    0.8 1.48    3   3   3   3   3   3.56    3.2 2.76    2   2   2   2   2.68    2.44    2   1.76    1   1.4 2   2   1.56    2   2   2   2   2.04    2   2   1.76    1   1   1   1   0.56    0   0   0   0   0   0   0   0   0.72    1.52    2   2   2   2   2   2   1.28    0.48                                                            

Note: the map syntax is cute, but if the file is large you should be using a while loop for efficiency. The while loop avoids reading the whole file into memory:

while (<DATA>)
{
    if (/^([\w\s]*?)\s*,\s*(.*)/)
    {
        $line_max{$1} = max split ' ', $2;  
    }
    else
    {
        print "Line $. is bad.\n";  
    }   
}

Upvotes: 1

flesk
flesk

Reputation: 7579

If I understand your problem correctly you're making it overly complicated:

#!/usr/bin/env perl
use strict;
use warnings;
use List::Util qw(max);

#Input filename
my $file = 'test1.data';

#Open file
open my $fh, '<', $file or die "Unable to open $file: $!\n";

my ($total, $num);

while (<$fh>) {
    my @values = split;
    my $max = max(@values[3 .. $#values]);
    $total += $max;
    $num++;
}

my $average_max = $total / $num;

Just make one pass over your file, splitting the lines into an array and feeding everything from index 3 to max. Add $max to $total for each line, increment a counter ($num) and calculate average max from that.

You should also always use use strict and lexical filehandles.

Upvotes: 7

Related Questions