Stephen McGowan
Stephen McGowan

Reputation: 25

Modifying a perl script

I'm kinda brand new to perl (well programming in general), and have been presented with a perl script (Id_script3.pl).

Code in question from Id_script3.pl:

# main sub 
{ # closure 
# keep %species local to sub-routine but only init it once 
my %species; 
sub _init { 
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!"; 
    my $spec; 
    while (<$in>) { 
        chomp; 
        next if /^\s*$/; # skip blank lines 
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) { 
            # handle letter = lines 
            $species{$spec}{$1} = [$2]; 
            push @{$species{$spec}{$1}}, $3 if $3; 
        } else { 
            # handle species name lines 
            $spec = $_; 
            $len = length($spec) if (length($spec) > $len); 
        } 
    } 
    close $in; 
} 
sub analyze { 
    my ($masses) = @_; 
    _init() unless %species; 
    my %data; 
    # loop over species entries 
SPEC: 
    foreach my $spec (keys %species) { 
        # loop over each letter of a species 
LTR: 
        foreach my $ltr (keys %{$species{$spec}}) { 
            # loop over each mass for a letter 
            foreach my $mass (@{$species{$spec}{$ltr}}) { 
                # skip to next letter if it is not found 
                next LTR unless exists($masses->{$mass}); 
            } 
            # if we get here, all mass values were found for the species/letter 
            $data{$spec}{cnt}++; 
        } 
    }

The script requires a modification, in which 'SpeciesId3.txt' will be used instead of the 'SpeciesId.txt' which is currently used by the script.

There is a slight difference between the two files, so a slight modification would need to be made to the script for it to function; the difference being that SpeciesId3.txt contains no letters (A =, B =, C =) and simply a (much) longer list of values as compared to the original 'SpeciesId.txt'.

SpeciesId.txt:

African Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Indian Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Rabbit

A = 1221.6 AND 1235.6
B = 1453.7
C = 1592.8
D = 2129.1
E = 2808.4
F = 2883.5 AND 2899.5
G = 2957.4 AND 2973.4

SpeciesID3.txt (File to be used/script to be modified for:)

African Elephant


826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6

Indian Elephant

835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7


Rabbit

817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7

As you can see, the letters (A =, B = ) have been lost for SpeciesID3.txt.

I've tried a couple of attempted "work-arounds" but am yet to write one that works.

Many Thanks,

Stephen.

Upvotes: 0

Views: 1029

Answers (2)

tripleee
tripleee

Reputation: 189307

if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {

This line contains a regular expression which looks for an uppercase letter [A-Z] followed by an equals sign with optional whitespace on either side \s*=\s*. You basically just want to remove that prefix and simply match a number (\d+(?:\.\d)?).

Because $1, $2, $3 are numbered starting from the leftmost opening parenthesis, the number you want will be in $1 now. (Parentheses with ?: are non-capturing, and don't count.)

You also need to change the variable %species so that its keys are species names and its values simply a list of numbers (the extracted observations).

So:

if (m{^(\d+(?:\.\d)?)$}) { 
    push ${$species{$spec}}, $1; 
}

The analyze subroutine needs to be similarly adapted (the LTR level is basically gone now).

Upvotes: 0

TLP
TLP

Reputation: 67900

Well, I don't know if I would consider keeping that script, as it looks rather messy, using script-globals inside subroutines, and strange labels. Here's a method you might like to consider, using Perl's paragraph mode by setting the input record separator $/ to the empty string.

This is a bit clunky since chomp cannot remove newlines from hash keys, so I used a do block to compensate. do { ... } works like a subroutine and returns the value of its last executed statement, in this case returns the elements of the array.

use strict;
use warnings;
use Data::Dumper;

local $/ = "";        # paragraph mode

my %a = do { my @x = <DATA>; chomp(@x); @x; };  # read the file, remove newlines
$_ = [ split ] for values %a;                   # split numbers into arrays
print Dumper \%a;                               # print data structure

__DATA__
African Elephant


826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6

Indian Elephant

835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7


Rabbit

817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7

Output:

$VAR1 = {
          'Rabbit' => [
                        '817.5',
                        '836.4',
                        '852.4',
                        '868.5',
                        '872.4',
                        '886.4',
                        '892.5',
                        '898.5',
                        '908.5',
                        '950.5',
                        '977.5',
                        '1029.5',
                        '1088.6',
                        '1095.6',
                        '1105.6',
                        '1125.5',
                        '1138.6',
                        '1161.6',
                        '1177.6',
                        '1182.6',
                        '1201.6',
                        '1221.6',
                        '1235.6',
                        '1267.7',
                        '1280.6',
                        '1311.6',
                        '1332.7',
                        '1378.5',
                        '1437.7',
                        '1453.7',
                        '1465.7',
                        '1469.7'
                      ],
          'Indian Elephant' => [
                                 '835.4',
                                 '836.4',
                                 '840.5',
                                 '852.4',
                                 '868.4',
                                 '877.4',
                                 '886.4',
                                 '892.5',
                                 '894.5',
                                 '898.5',
                                 '908.5',
                                 '920.5',
                                 '950.5',
                                 '1095.6',
                                 '1105.6',
                                 '1154.6',
                                 '1161.6',
                                 '1180.7',
                                 '1183.6',
                                 '1189.6',
                                 '1196.6',
                                 '1201.6',
                                 '1211.6',
                                 '1230.6',
                                 '1261.6',
                                 '1267.7'
                               ],
          'African Elephant' => [
                                  '826.4',
                                  '836.4',
                                  '840.4',
                                  '852.4',
                                  '858.4',
                                  '886.4',
                                  '892.5',
                                  '898.5',
                                  '904.5',
                                  '920.5',
                                  '950.5',
                                  '1001.5',
                                  '1015.5',
                                  '1029.5',
                                  '1095.6',
                                  '1105.6'
                                ]
        };

As you can see from this rather verbose output, the result is a hash with animals as keys, and your numbers as values. As long as you can rely on the names and numbers being separated by at least two consecutive newlines, and there are no arbitrary newlines inside the data, this method will do the trick.

Upvotes: 2

Related Questions