monototo
monototo

Reputation: 3

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:

<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">

into

<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">

The regex for the closing parenthesis works fine

perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html

giving me

    <a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">

The problem arrises with the equivalent regex for the opening parenthesis:

perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html                                

just returns the two groups with nothing in between:

<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">

Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).

If however in a fit of desperation I add seven parenthesises into the substitution, it works.

perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html

outputs

<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">

Can somebody please make sense of this.

Upvotes: 0

Views: 341

Answers (3)

Borodin
Borodin

Reputation: 126742

The pattern you have doesn't match the string you show at all. It matches something that looks like

<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">

with literal dots, and whatever $i contains.

Also, a couple of points about your substitution:

  • Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.

  • Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.

So your regex could be written

s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;

but it still doesn't "work fine" with the string you gave, which would have to look something like

<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">

again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.

However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.

There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like

s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;

which works fine on the string you gave, and replaces both open and close parentheses at once.

Upvotes: 0

Kenosis
Kenosis

Reputation: 6204

Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.

use strict;
use warnings;
use v5.10.0; # For regex \K

use URI::Escape;

my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;

Output:

<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">

Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.

Upvotes: 3

perreal
perreal

Reputation: 98028

I had some problems understanding your regex, but this might work:

 perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Upvotes: 0

Related Questions