Improving a perl balanced regular expression

Question

I am using the following perl regular expression to clean xml/html style formating tags from input.

$expr = qr{
    <\s*a(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*a(?:\s*|\s+[^>]+)>|)).)+ )
      |
        (??{ $expr })
    )*)
    
  }x;

Applying it recursively it will remove nested ... tags (not that this would make sense if makes a hyperlink) and keep only the bracketed text:

    my $tmp_text = "a e cg  d df";
    print $tmp_text."
";

    $tmp_text=~s/$expr/$1/g;
    print $tmp_text."
";

    $tmp_text=~s/$expr/$1/g;
    print $tmp_text."
";

This will print

    a e cg  d df
    a e cg  d df
    a e cg  d df

Now, I would like to do the same with all other formatting tags, like .. and so on. I can surely make a list of all supported tags, replace a with b etc. in $expr, and repeat the substitution with each of them.

However, I wonder if there is a more efficient/compact way by modifying $expr such that it will do balanced matching for whatever name is in ....

Note that I consciously avoid using perl packages for xml/html parsing or cleaning tools. The input I am processing is not strict html and I do not want to include dependencies.

bytepusher · Accepted Answer

I believe this meets your stated requirements:

I replaced the 'a' in the regex with a [a-z]+, captured and backreferenced it. That does mean you have to change your line applying it to replace with $2 instead.

If you wanted to make a list of accepted tags ( which still seems better to me, but I do not know your use case ), you could replace the [a-z]+ with, for example, a list of acceptable tags joined by |.

$expr = qr{
    <\s*([a-z]+)(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|)).)+ )
      |
        (??{ $expr })
    )*)
    
  }x;

A short example script with a tag:

#!/usr/bin/env perl

use strict;
use warnings;

my $expr;

$expr = qr{
    <\s*([a-z]+)(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|)).)+ )
      |
        (??{ $expr })
    )*)
    
  }x;


my $tmp_text = 'a e cg  d df';
print $tmp_text."
";

print $tmp_text."
" while $tmp_text =~s/$expr/$2/g;

Wiktor has posted a regex in comments which also allows for capital letters and '_' - if that is what you want, just replace [a-z] with [a-zA-Z_] as in his example.

Improving a perl balanced regular expression

Answers (1)

Related Questions