Reputation: 1697

Why does running the same regex twice yield different results?

While trying to make a response to this question, I've encountered some odd behavior from Perl's regex engine. I have a string that contains 2 quantities that I'm trying to match with a regex. The regex just matches any 8 characters before the string "units/ml". I want to grab both units.

This script only prints the 2nd one that is matched:

use warnings;
use strict;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ($line =~ m/.{8}units\/ml/g) {
    @array = $line =~ m/.{8}units\/ml/g;
    print join(' ', @array) . "\n";
}

Its output:

 20,000 units/ml

If I run line 6 twice, the line that assigns to @array:

use warnings;
use strict;
my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ($line =~ m/.{8}units\/ml/g) {
    @array = $line =~ m/.{8}units\/ml/g;
    # Let's run that again, for good measure...
    @array = $line =~ m/.{8}units\/ml/g;
    print join(' ', @array) . "\n";
}

Its output:

100,000 units/ml  20,000 units/ml

Why do these two scripts yield different results?

Upvotes: 2

Answers (4)

user3408541

Reputation: 81

If you wanted to verify your input a little you could use something like this

#!/usr/bin/perl -w

my $line = 'some data 100,000 units/ml data 20,000 units/ml data 4 units/ml data 20,554,323,765,323 units/ml data 1,,2,,3,, units/ml data';

my $i=0;
while( $line =~ m/ ((\d+|\d{1,3}(,\d{3})*)(\.\d+)?) units\/ml/g ) {
  print "Match: \"$1\" at Position:" . pos($line) . " Iteration: $i\n" if($1);
  $i++;
}

Output looks like this

$perl numbers.with.commas.pl
Match: "100,000" at Position:26 Iteration: 0
Match: "20,000" at Position:47 Iteration: 1
Match: "4" at Position:63 Iteration: 2
Match: "20,554,323,765,323" at Position:96 Iteration: 3

The first regular expression

m/.{8}units\/ml/g

will incorrectly match whitespace and non-digits. It will also miss correct numbers shorter or longer than 8 characters. The second regular expression

m/([0-9,]+) units\/ml/g

will usually work, but will incorrectly match on improperly formed numbers like 1,,2,,3,,

m/(([0-9]{1,3},?)+) units\/ml/g

This one seems to work slightly better than the previous, but will incorrectly match improperly formed numbers like 1,200,3,4,500

So I searched for the One True™ regex and found one here Regular expression to match numbers with or without commas and decimals in text

m/ ((\d+|\d{1,3}(,\d{3})*)(\.\d+)?) units\/ml/g #commas optional

Its a bit more complicated, but appears to be working. It will match numbers with and without commas as long as it is consistent. 1,000,000 and 1000000 will match, but not 1,000000. Also it will match any decimals if they are there. As far as I can tell this is the regex you should use, and will correctly verify it is a properly formed number. The numbers are allowed to be either entirely with or entirely without commas, but not mixed. If commas are required, the following should work

m/ ((?=.)(\d{1,3}(,\d{3})*)?(\.\d+)?) units\/ml/g #commas required

Whenever I need to use the /g modifier, I kind of prefer to use while loops so the matching will continue to the end of the string. If you use scalar context, the matching will stop at the first match, return true, and will not continue to the end of the string. When you use a while loop, each iteration will start searching at the position of the last match, and will continue to the end of the string.

Good Luck!

Upvotes: 0

Borodin

Reputation: 126762

The problem is here

if ($line =~ m/.{8}units\/ml/g) { ... }

a global match in scalar context will match the next occurrence of the pattern and set a mark to say where the next global match should begin

After that there is only 20,000 units/ml remaining that will match the pattern, so it matches only once

To collect all digits or commas in the string followed by units/ml you should write something like this

use strict;
use warnings;

my $line = 'some data 100,000 units/ml data 20,000 units/ml data';

my @array = $line =~ m|([0-9,]+)\s*units/ml|g;

print "$_\n" for @array;

output

100,000
20,000

Upvotes: 0

Kenosis

Reputation: 6204

An option, in this case, is to evaluate the array assignment in the if statement:

use Modern::Perl;

my $line = 'some data 100,000 units/ml data 20,000 units/ml data';
my @array;
if ( @array = $line =~ m/.{8}units\/ml/g ) {
    print join( ' ', @array ) . "\n";
}

Output:

100,000 units/ml  20,000 units/ml

And appropriate action can be taken, if needed, if no matching occurred.

Upvotes: 1

Tanktalus

Reputation: 22294

It's because of the /g modifier in your if. Since the if is evaluating the =~ in scalar context, it only gets the first item matched. Then, inside your if block, the @array assignment continues the search from where it left off. (This is useful for parsing.)

When you run the extra match, you've already finished matching everything in the string, so you start over from the beginning again, in list context, and you get everything then.

If you remove the g flag in your if, then things work as you expect.

Upvotes: 4

Why does running the same regex twice yield different results?

Answers (4)

Related Questions