Reputation: 39158
This is a follow-up from Perl regular expression to match an IP address. I wanted to show how to solve the problem correctly, but ran into an unexpected behaviour.
use 5.010;
use strictures;
use Data::Munge qw(list2re);
use Regexp::IPv6 qw($IPv6_re);
use Regexp::Common qw(net);
our $port_re = list2re 0..65535;
sub ip_port_from_netloc {
my ($sentence) = @_;
return $sentence =~ /
( # capture either
(?<= \[ )
$IPv6_re # IPv6 address without brackets
(?= \] )
| # or
$RE{net}{IPv4} # IPv4 address
)
: # colon sep. host from port
($port_re) # capture port
/msx;
}
my ($ip, $port);
($ip, $port) = ip_port_from_netloc 'The netloc is 216.108.225.236:60099';
say $ip;
($ip, $port) = ip_port_from_netloc 'The netloc is [fe80::226:5eff:fe1e:dfbe]:60099';
say $ip;
The second match fails. use re 'debugcolor'
reveals that :($port_re)
already matches :5
within the IPv6 address. This surprises me because I did not switch off greediness with a ?
. I expected it to gobble up everything up to the ]
, only then match against the separating colon and what follows after.
Why does this happen, and what's the remedy?
Upvotes: 2
Views: 400
Reputation: 386206
Greed would only come into play if one of your atoms has a choice in how much it can match (i.e. if you used *
, +
, ?
or {n,m}
). This is not a greediness issue.
The problem is that the regex will only match an IPv6 address if it's immediately followed by both "]
" and by ":
". That can't possibly happen.
You could use two different matches, or you could use something like the following:
my $port_re = list2re 0..65535;
my $IPv4_re = $RE{net}{IPv4};
sub ip_port_from_netloc {
my ($sentence) = @_;
return if $sentence !~ /
(?: \[ ( $IPv6_re ) \]
| ( $IPv4_re )
)
: ($port_re)
/msx;
return ($1 // $2, $3);
}
Maybe this is a bit cleaner?
my $port_re = list2re 0..65535;
my $IPv4_re = $RE{net}{IPv4};
sub ip_port_from_netloc {
my ($sentence) = @_;
return if $sentence !~ /
(?: \[ (?<addr> $IPv6_re ) \]
| (?<addr> $IPv4_re )
)
: (?<port> $port_re )
/msx;
return ( $+{addr}, $+{port} );
}
Upvotes: 6
Reputation: 12802
Zero-width assertions don't get consumed, so the literal right-bracket is still there to be matched against following the first capture group. This adjustment appears to work:
/
\[?( # capture either
(?<= \[ )
$IPv6_re # IPv6 address without brackets
(?= \] )
| # or
(?<! \[ )
$RE{net}{IPv4} # IPv4 address
(?! \] )
)\]?
: # colon sep. host from port
($port_re) # capture port
/msx;
Upvotes: 3