Reputation: 25
I'm kinda brand new to perl (well programming in general), and have been presented with a perl script (Id_script3.pl).
Code in question from Id_script3.pl:
# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
my $spec;
while (<$in>) {
chomp;
next if /^\s*$/; # skip blank lines
if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
# handle letter = lines
$species{$spec}{$1} = [$2];
push @{$species{$spec}{$1}}, $3 if $3;
} else {
# handle species name lines
$spec = $_;
$len = length($spec) if (length($spec) > $len);
}
}
close $in;
}
sub analyze {
my ($masses) = @_;
_init() unless %species;
my %data;
# loop over species entries
SPEC:
foreach my $spec (keys %species) {
# loop over each letter of a species
LTR:
foreach my $ltr (keys %{$species{$spec}}) {
# loop over each mass for a letter
foreach my $mass (@{$species{$spec}{$ltr}}) {
# skip to next letter if it is not found
next LTR unless exists($masses->{$mass});
}
# if we get here, all mass values were found for the species/letter
$data{$spec}{cnt}++;
}
}
The script requires a modification, in which 'SpeciesId3.txt' will be used instead of the 'SpeciesId.txt' which is currently used by the script.
There is a slight difference between the two files, so a slight modification would need to be made to the script for it to function; the difference being that SpeciesId3.txt contains no letters (A =, B =, C =) and simply a (much) longer list of values as compared to the original 'SpeciesId.txt'.
SpeciesId.txt:
African Elephant
B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4
Indian Elephant
B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4
Rabbit
A = 1221.6 AND 1235.6
B = 1453.7
C = 1592.8
D = 2129.1
E = 2808.4
F = 2883.5 AND 2899.5
G = 2957.4 AND 2973.4
SpeciesID3.txt (File to be used/script to be modified for:)
African Elephant
826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6
Indian Elephant
835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7
Rabbit
817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7
As you can see, the letters (A =, B = ) have been lost for SpeciesID3.txt.
I've tried a couple of attempted "work-arounds" but am yet to write one that works.
Many Thanks,
Stephen.
Upvotes: 0
Views: 1029
Reputation: 189307
if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
This line contains a regular expression which looks for an uppercase letter [A-Z]
followed by an equals sign with optional whitespace on either side \s*=\s*
. You basically just want to remove that prefix and simply match a number (\d+(?:\.\d)?)
.
Because $1
, $2
, $3
are numbered starting from the leftmost opening parenthesis, the number you want will be in $1
now. (Parentheses with ?:
are non-capturing, and don't count.)
You also need to change the variable %species
so that its keys are species names and its values simply a list of numbers (the extracted observations).
So:
if (m{^(\d+(?:\.\d)?)$}) {
push ${$species{$spec}}, $1;
}
The analyze
subroutine needs to be similarly adapted (the LTR
level is basically gone now).
Upvotes: 0
Reputation: 67900
Well, I don't know if I would consider keeping that script, as it looks rather messy, using script-globals inside subroutines, and strange labels. Here's a method you might like to consider, using Perl's paragraph mode by setting the input record separator $/
to the empty string.
This is a bit clunky since chomp
cannot remove newlines from hash keys, so I used a do
block to compensate. do { ... }
works like a subroutine and returns the value of its last executed statement, in this case returns the elements of the array.
use strict;
use warnings;
use Data::Dumper;
local $/ = ""; # paragraph mode
my %a = do { my @x = <DATA>; chomp(@x); @x; }; # read the file, remove newlines
$_ = [ split ] for values %a; # split numbers into arrays
print Dumper \%a; # print data structure
__DATA__
African Elephant
826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6
Indian Elephant
835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7
Rabbit
817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7
Output:
$VAR1 = {
'Rabbit' => [
'817.5',
'836.4',
'852.4',
'868.5',
'872.4',
'886.4',
'892.5',
'898.5',
'908.5',
'950.5',
'977.5',
'1029.5',
'1088.6',
'1095.6',
'1105.6',
'1125.5',
'1138.6',
'1161.6',
'1177.6',
'1182.6',
'1201.6',
'1221.6',
'1235.6',
'1267.7',
'1280.6',
'1311.6',
'1332.7',
'1378.5',
'1437.7',
'1453.7',
'1465.7',
'1469.7'
],
'Indian Elephant' => [
'835.4',
'836.4',
'840.5',
'852.4',
'868.4',
'877.4',
'886.4',
'892.5',
'894.5',
'898.5',
'908.5',
'920.5',
'950.5',
'1095.6',
'1105.6',
'1154.6',
'1161.6',
'1180.7',
'1183.6',
'1189.6',
'1196.6',
'1201.6',
'1211.6',
'1230.6',
'1261.6',
'1267.7'
],
'African Elephant' => [
'826.4',
'836.4',
'840.4',
'852.4',
'858.4',
'886.4',
'892.5',
'898.5',
'904.5',
'920.5',
'950.5',
'1001.5',
'1015.5',
'1029.5',
'1095.6',
'1105.6'
]
};
As you can see from this rather verbose output, the result is a hash with animals as keys, and your numbers as values. As long as you can rely on the names and numbers being separated by at least two consecutive newlines, and there are no arbitrary newlines inside the data, this method will do the trick.
Upvotes: 2