daxim
daxim

Reputation: 39158

How come this regex is not greedy?

This is a follow-up from Perl regular expression to match an IP address. I wanted to show how to solve the problem correctly, but ran into an unexpected behaviour.

use 5.010;
use strictures;
use Data::Munge qw(list2re);
use Regexp::IPv6 qw($IPv6_re);
use Regexp::Common qw(net);

our $port_re = list2re 0..65535;

sub ip_port_from_netloc {
    my ($sentence) = @_;
    return $sentence =~ /
        (                   # capture either
          (?<= \[ )
            $IPv6_re        #  IPv6 address without brackets
          (?=  \] )
        |                   # or
            $RE{net}{IPv4}  #  IPv4 address
        )
        :                   # colon sep. host from port
        ($port_re)          #   capture port
    /msx;
}

my ($ip, $port);
($ip, $port) = ip_port_from_netloc 'The netloc is 216.108.225.236:60099';
say $ip;
($ip, $port) = ip_port_from_netloc 'The netloc is [fe80::226:5eff:fe1e:dfbe]:60099';
say $ip;

The second match fails. use re 'debugcolor' reveals that :($port_re) already matches :5 within the IPv6 address. This surprises me because I did not switch off greediness with a ?. I expected it to gobble up everything up to the ], only then match against the separating colon and what follows after.

Why does this happen, and what's the remedy?

Upvotes: 2

Views: 400

Answers (2)

ikegami
ikegami

Reputation: 386206

Greed would only come into play if one of your atoms has a choice in how much it can match (i.e. if you used *, +, ? or {n,m}). This is not a greediness issue.

The problem is that the regex will only match an IPv6 address if it's immediately followed by both "]" and by ":". That can't possibly happen.

You could use two different matches, or you could use something like the following:

my $port_re = list2re 0..65535;
my $IPv4_re = $RE{net}{IPv4};

sub ip_port_from_netloc {
    my ($sentence) = @_;
    return if $sentence !~ /
        (?: \[ ( $IPv6_re ) \]
        |      ( $IPv4_re )
        )
        : ($port_re)
    /msx;

    return ($1 // $2, $3);
}

Maybe this is a bit cleaner?

my $port_re = list2re 0..65535;
my $IPv4_re = $RE{net}{IPv4};

sub ip_port_from_netloc {
    my ($sentence) = @_;
    return if $sentence !~ /
        (?: \[ (?<addr> $IPv6_re ) \]
        |      (?<addr> $IPv4_re )
        )
        : (?<port> $port_re )
    /msx;

    return ( $+{addr}, $+{port} );
}

Upvotes: 6

Richard Sim&#245;es
Richard Sim&#245;es

Reputation: 12802

Zero-width assertions don't get consumed, so the literal right-bracket is still there to be matched against following the first capture group. This adjustment appears to work:

/
    \[?(                   # capture either
      (?<= \[ )
        $IPv6_re        #  IPv6 address without brackets
      (?=  \] )
    |                   # or
        (?<! \[ )
        $RE{net}{IPv4}  #  IPv4 address
        (?! \] )
    )\]?
    :                   # colon sep. host from port
    ($port_re)          #   capture port
/msx;

Upvotes: 3

Related Questions