theraccoonbear
theraccoonbear

Reputation: 4337

Odd Perl Regex Behavior with Parens

I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?

Upvotes: 1

Views: 132

Answers (2)

Stuart Watt
Stuart Watt

Reputation: 5401

There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.

$body = <<"__ENDHTML__";
<a href="/wiki/Eon_(geology)">Body</a> Blah blah 
<a href="/wiki/Some_other_(parenthesis)">Body</a>
__ENDHTML__

while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) { 
  my $url = $+{url};
  print "$url\n";
}

Are you using an old Perl?

Upvotes: 1

Seth Robertson
Seth Robertson

Reputation: 31461

You didn't anchor the RE to the end of the string. Put a " afterwards.

While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"

Upvotes: 0

Related Questions