cristiangm
cristiangm

Reputation: 23

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.

I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.

For example: (http://www.example.com)

The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:

In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:

use strict;

use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
                $str =~ s/\)/ \)/;
        }
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;

The output that I want is: (http://www.example.com )

I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;

BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:

  1. ablalbalblalblalbal http://www.example.com
  2. asfasdfasdf http://www.example.com aasdfasdfasdf

I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.

Upvotes: 1

Views: 541

Answers (3)

plusplus
plusplus

Reputation: 2030

Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:

#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;

my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";

The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).

You could also make the brackets optional, with something like:

/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /

although that would result in two match variables, one undefined - so something like following would be needed:

my $uri = $1 || $2 || die "Didn't match a URL!";

There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...

To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:

/$RE{URI}\Z/

Upvotes: 0

Sinan Ünür
Sinan Ünür

Reputation: 118148

You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.

#!/usr/bin/perl

use strict; use warnings;

my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;

print "$str\n";

The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.

Upvotes: 2

Dave Cross
Dave Cross

Reputation: 69274

my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
  $str =~ s/)/ )/;
}

Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.

my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
  $str =~ s/)/ )/;
  print $str, "\n";
}

The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.

And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.

The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).

my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
  $str =~ s/)/ )/;
  print $str, "\n";
}

In this case the loop is only executed once.

Upvotes: 0

Related Questions