Reputation: 4337
I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology
)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_
. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?
Upvotes: 1
Views: 132
Reputation: 5401
There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.
$body = <<"__ENDHTML__";
<a href="/wiki/Eon_(geology)">Body</a> Blah blah
<a href="/wiki/Some_other_(parenthesis)">Body</a>
__ENDHTML__
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
Are you using an old Perl?
Upvotes: 1
Reputation: 31461
You didn't anchor the RE to the end of the string. Put a " afterwards.
While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"
Upvotes: 0