Reputation: 215

Find the maximum and a set of largest numbers (in scientific notation) contained in a huge ascii file

Background:

(1) Here is what I extract from a huge ascii file of around 700Mb:

0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05,
    2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05,
    1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,
    4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,
    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,
    8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

(2) I would like to do two tasks:

(2.1) Find the maximum among the numbers separated by colons and semicolons.

It is 5.0003081213 in the above extracted lines.

(2.2) Find the largest 4 (says) values among the lines.

It is 5.0003081213, 0.000421869, 0.0003385935 and 0.0002973858 in the above extracted lines.

My thought:

(3) I expect to do the work with perl.

(4) I think that I can match the number with ([0-9.e-]+).

My Problem:

(5) However, I am new to perl and unix and I do not know how to proceed to find the maximum values.

(6) I searched similar questions for a half day and found that I may make use of List::Util. I do not know it is an appropriate choice for my problem and actually I do not know how this subroutine can be adopted.

(7) Says, the numbers are contained in a file, named input.txt. May I know if it is possible to finish the tasks with a one line script?

Thanks for your understanding and I appreciate so much for your help.

Further Question raised:

Thanks to many warm replies and help from stack overflow users, I got the above question solved. However, if I would like to find out a maximum only from Line 3 to Line 6 of the following data:

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.193129938e-07, 0, 0, 0, 0, 0, 0,
    0, 2.505016514e-05, 4.835713883e-05, 6.128770648e-05, 1.38018881e-05, 2.303402101e-05,
    0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0.000101613512, 5.451410965e-05, 0, 0, 0, 0, 0.001172270099, 7.088900819e-05, 0,
    1.848198352e-06, 0.0006870109246, 0.00276857581, 0.002038545509, 0.001111047938,
    0.0007607533934, 0.0007915864957, 0.001105735631, 0.001456989534, 0.0007245351113,
    0.0004262289031, 0.0003041285247, 0.0001528418892, 2.332078749e-05, 9.695149464e-05,
    1.004024021e-07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

That is,

0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Then, how can I modify the script grep -o '[0-9e.-]*' file | sort -rg | head -1 to achieve this purpose?

I know that the command sed can work on lines of files by adding an option (3,6p). So, I am wondering if I can modify the above scripts by adding an option like this. I appreciate your help again.

Upvotes: 0

Answers (7)

Sobrique

Reputation: 53498

From a perl perspective, what is useful to know $/ is the record separator. By default, it's linefeed, but you can set it to anything you like.

Looking at your sample data, therefore I'd say:

#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw ( max );

$/ = ';';

while (<>) {
    s/;//g;
    my @lines = split("\n");
    s/\s+//g;
    my $block_max = max( split(",") );
    last unless defined $block_max;
    print $block_max, "\n";

    my @top;
    foreach my $line (@lines) {
        $line =~ s/\s+//g;
        my @numbers = split( ",", $line );
        my $max_num = max(@numbers);
        if ( defined $max_num ) { push( @top, $max_num ) }
    }

    print "Top 5:\n";
    print join( "\n", ( sort { $b <=> $a } (@top) )[ 0 .. 4 ] );
}

What we do is:

iterate your file based on ;.
Split on \n to get some lines.
split on , to get individual values.
use max on the block - print that.
use max on each line, stuff that in an array @top.
print the sorted first 5 elements from @top.

Then move on to the next ; delimited 'chunk'.

To extend - based on your original file, you can include in there a regex to extract numbers.

E.g.

my @numbers = m/[\d+.-]+/g;

because of the way perl handles regular expressions, it'll 'match' all the chunks that fit this particular 'format'. (Of course if someone includes ee-44 in the file, that'll match too).

I would suggest - don't go looking for one liners. It's a false economy. Far better to have a script that you can write out, comment and actually understand later, than a compact block of text that no one can tell what's going on in 12 months time.

Upvotes: 0

fedorqui

Reputation: 290025

I would use a combination of grep and sort:

grep -o '[0-9e.-]*' file | sort -rg | head -N

The command grep -o '[0-9e.-]\+' (using the regex provided in the question) extracts all the numbers in the file.
Then, sort -g sorts taking exponential values into consideration; by using -r we reverse the results, so that the top values show at the top.
Finally, head gets the top N values.

Top value:

$ grep -o '[0-9e.-]*' file | sort -rg | head -1
5.0003081213

Top 4:

$ grep -o '[0-9e.-]*' file | sort -rg | head -4
5.0003081213
0.000421869
0.0003385935
0.0002973858

Upvotes: 7

Dmitry Egorov

Reputation: 9650

UPDATE: The one-liner:

perl -nle 'foreach (split(",|;")) { $_ += 0; @top_n = sort {$b <=> $a} ($_, @top_n); pop @top_n if @top_n > 4; } END { print foreach @top_n; }' input.txt

Nam, the other solutions are just fine and, I believe, have already helped you in solving you problem. However, they don't take into account the huge input. Even lue's solution implies storing the entire array in memory and performing sort operation against all these hundreds of megabytes. Although I totally support lue in his idea of not redefining the input record separator and reading line by line. This really helps when processing huge files.

There're only about 5 lines of actual code. The rest are comments which will help you understand what's behind the scene and hopefully help you learn a bit of perl.

#!/usr/bin/perl -nl

# 0) The -n from above would make the script read the input line by line
# and the -l parameter would automatically strip off any newline chars
# from input and add a newline to every output line

# 1.1) So, the -n parameter made perl read a line from STDIN and place it
# into $_ variable for you. The following code (excluding the END{} block)
# is executed for every input line.
# 1.2) split() takes this $_ string and breaks it into a series of numbers
# (technically still sub-strings), returning the series as an array
# 1.3) Then foreach loops through this array placing each array's item into
# $_ again. (NB. Yes, we're losing the previous $_'s value which was an input
# string but we don't care about it any longer since we've already processed
# it with split().)
foreach (split(",|;")) {

    # 2) Ensure its stored internally as a numeral by adding zero to it.
    # This would save us a bit of conversion when sorting values and also
    # make final output nicer. Still, you'll get what you want if you
    # comment the following line out.
    $_ += 0;

    # 3.1) Compose a new array by adding the current value ($_) to what
    # we already have (@top_n). The new array is "($_, @top_n)". It's OK
    # if @top_n has nothing in it or even undefined so far, perl will
    # define and initialise it with an empty array when it encounters
    # the @top_n variable first time. (Note: we should better use -w
    # perl command line parameter and define @top_n explicitly beforehand
    # but I'm omitting it here for the sake of simplicity.)
    # 3.2) Then sort the new array. The "$b <=> $a" expression will make
    # it sorted in descending order.
    @top_n = sort {$b <=> $a} ($_, @top_n);

    # 3.3) Finally, throw away the last item (pop does this) if our top-N
    # array has grown beyond the lenth or interest (4 in this example).
    # This helps keeps our sript's memory consumption reasonaably low.
    # Without doing this we'd ended up with several hundreds of megabytes
    # in memory which would require sorting.
    pop @top_n if @top_n > 4;
}

# 4) This block is only executed once, after all the input file is read and
# processed.
END {
    # 4.1) Here our old good foreach reads the @top_n array storing
    # current value in $_ for each iteration.
    # 4.2) Being called without parameters, print() outputs the value
    # of $_ variable. Remember, it also adds a newline to the output
    # - we told it doing so by adding -l in the very first line of the
    # script.
    print foreach @top_n;
}

Usage: perl top_n.pl input.txt, provided top_n.pl is the script name.

Upvotes: 1

leu

Reputation: 2081

I understand your question in that way that you want to filter numbers from your huge input file. So, splitting at delimiters is not sufficient but instead you need to extract numbers by a regex.

This is my attempt:

use strict;
use warnings;

my(@numbers);
while (my $line = <>) {
    while($line =~ m|([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)|g) {
        push @numbers, $1;
    }
}
@numbers = sort { $b <=> $a } @numbers;

print "largest value:\n  $numbers[0]\n";
print "next four numbers: \n  " . join("\n  ",@numbers[1..4]) . "\n";

It's not a one-liner but maybe better to read.

Use it like this: perl findNumbers.pl input.txt where findNumbers.pl is the script as above.

Upvotes: 1

simbabque

Reputation: 54373

This solution is a very verbose and assumes you already know how to get the data into the program. There is no need to find numbers with regex. You can just split on comma, get a list and sort it by size.

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'max';

# I'm assuming you already have that data in one line in a variable
my $data = qq{0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;};

# remove the semicolon
chop $data;

# split to a list on comma and possible whitespace
my @numbers = split /,\s*/, $data;

# this is from List::Util
say 'Max: ' . max(@numbers);

# sort numerical and grab the highest 4
say $_ for ( reverse sort { $a <=> $b } @numbers )[ 0 .. 3 ];

Upvotes: 1

lamchob

Reputation: 80

if you really want to use a one line script, you can use this to get the largest value:

$/=undef;print "largest: " .(sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0] . "\n";

And this to get the four largest Values:

$/=undef;print join ("," , (sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0..3]) . "\n";

Save one of these lines into a file, say sort.pl, and execute cat /path/to/input.txt | perl /path/to/sort.pl

Although it does what should, it is not the prettiest solution.

Upvotes: 0

hek2mgl

Reputation: 158090

awk can work with numbers - even in scientific notation. You can use the following script the get the maximum:

awk '{m=(m>$0)?m:$0}END{print m}' RS="[,\n;]" input.file

Upvotes: 1

Find the maximum and a set of largest numbers (in scientific notation) contained in a huge ascii file

Answers (7)

Related Questions