Reputation: 1697
While trying to make a response to this question, I've encountered some odd behavior from Perl's regex engine. I have a string that contains 2 quantities that I'm trying to match with a regex. The regex just matches any 8 characters before the string "units/ml". I want to grab both units.
This script only prints the 2nd one that is matched:
use warnings;
use strict;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ($line =~ m/.{8}units\/ml/g) {
@array = $line =~ m/.{8}units\/ml/g;
print join(' ', @array) . "\n";
}
Its output:
20,000 units/ml
If I run line 6 twice, the line that assigns to @array:
use warnings;
use strict;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ($line =~ m/.{8}units\/ml/g) {
@array = $line =~ m/.{8}units\/ml/g;
# Let's run that again, for good measure...
@array = $line =~ m/.{8}units\/ml/g;
print join(' ', @array) . "\n";
}
Its output:
100,000 units/ml 20,000 units/ml
Why do these two scripts yield different results?
Upvotes: 2
Views: 329
Reputation: 81
If you wanted to verify your input a little you could use something like this
#!/usr/bin/perl -w
my $line = 'some data 100,000 units/ml data 20,000 units/ml data 4 units/ml data 20,554,323,765,323 units/ml data 1,,2,,3,, units/ml data';
my $i=0;
while( $line =~ m/ ((\d+|\d{1,3}(,\d{3})*)(\.\d+)?) units\/ml/g ) {
print "Match: \"$1\" at Position:" . pos($line) . " Iteration: $i\n" if($1);
$i++;
}
Output looks like this
$perl numbers.with.commas.pl
Match: "100,000" at Position:26 Iteration: 0
Match: "20,000" at Position:47 Iteration: 1
Match: "4" at Position:63 Iteration: 2
Match: "20,554,323,765,323" at Position:96 Iteration: 3
The first regular expression
m/.{8}units\/ml/g
will incorrectly match whitespace and non-digits. It will also miss correct numbers shorter or longer than 8 characters. The second regular expression
m/([0-9,]+) units\/ml/g
will usually work, but will incorrectly match on improperly formed numbers like 1,,2,,3,,
m/(([0-9]{1,3},?)+) units\/ml/g
This one seems to work slightly better than the previous, but will incorrectly match improperly formed numbers like 1,200,3,4,500
So I searched for the One True™ regex and found one here Regular expression to match numbers with or without commas and decimals in text
m/ ((\d+|\d{1,3}(,\d{3})*)(\.\d+)?) units\/ml/g #commas optional
Its a bit more complicated, but appears to be working. It will match numbers with and without commas as long as it is consistent. 1,000,000 and 1000000 will match, but not 1,000000. Also it will match any decimals if they are there. As far as I can tell this is the regex you should use, and will correctly verify it is a properly formed number. The numbers are allowed to be either entirely with or entirely without commas, but not mixed. If commas are required, the following should work
m/ ((?=.)(\d{1,3}(,\d{3})*)?(\.\d+)?) units\/ml/g #commas required
Whenever I need to use the /g modifier, I kind of prefer to use while loops so the matching will continue to the end of the string. If you use scalar context, the matching will stop at the first match, return true, and will not continue to the end of the string. When you use a while loop, each iteration will start searching at the position of the last match, and will continue to the end of the string.
Good Luck!
Upvotes: 0
Reputation: 126762
The problem is here
if ($line =~ m/.{8}units\/ml/g) { ... }
a global match in scalar context will match the next occurrence of the pattern and set a mark to say where the next global match should begin
After that there is only 20,000 units/ml
remaining that will match the pattern, so it matches only once
To collect all digits or commas in the string followed by units/ml
you should write something like this
use strict;
use warnings;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array = $line =~ m|([0-9,]+)\s*units/ml|g;
print "$_\n" for @array;
output
100,000
20,000
Upvotes: 0
Reputation: 6204
An option, in this case, is to evaluate the array assignment in the if
statement:
use Modern::Perl;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ( @array = $line =~ m/.{8}units\/ml/g ) {
print join( ' ', @array ) . "\n";
}
Output:
100,000 units/ml 20,000 units/ml
And appropriate action can be taken, if needed, if no matching occurred.
Upvotes: 1
Reputation: 22294
It's because of the /g modifier in your if. Since the if is evaluating the =~ in scalar context, it only gets the first item matched. Then, inside your if block, the @array assignment continues the search from where it left off. (This is useful for parsing.)
When you run the extra match, you've already finished matching everything in the string, so you start over from the beginning again, in list context, and you get everything then.
If you remove the g flag in your if, then things work as you expect.
Upvotes: 4