highsciguy
highsciguy

Reputation: 2637

Improving a perl balanced regular expression

I am using the following perl regular expression to clean xml/html style formating tags from input.

$expr = qr{
    <\s*a(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*a(?:\s*|\s+[^>]+)>|<\/\s*a\s*>)).)+ )
      |
        (??{ $expr })
    )*)
    <\/\s*a\s*>
  }x;

Applying it recursively it will remove nested <a>...</a> tags (not that this would make sense if <a> makes a hyperlink) and keep only the bracketed text:

    my $tmp_text = "a<a> e </a>c<a href="test">g <a> d</a> d</a>f";
    print $tmp_text."\n";

    $tmp_text=~s/$expr/$1/g;
    print $tmp_text."\n";

    $tmp_text=~s/$expr/$1/g;
    print $tmp_text."\n";

This will print

    a<a> e </a>c<a href="test">g <a> d</a> d</a>f
    a e cg <a> d</a> df
    a e cg  d df

Now, I would like to do the same with all other formatting tags, like <b>..</b> and so on. I can surely make a list of all supported tags, replace a with b etc. in $expr, and repeat the substitution with each of them.

However, I wonder if there is a more efficient/compact way by modifying $expr such that it will do balanced matching for whatever name is in <name something>...</name>.

Note that I consciously avoid using perl packages for xml/html parsing or cleaning tools. The input I am processing is not strict html and I do not want to include dependencies.

Upvotes: 0

Views: 109

Answers (1)

bytepusher
bytepusher

Reputation: 1578

I believe this meets your stated requirements:

I replaced the 'a' in the regex with a [a-z]+, captured and backreferenced it. That does mean you have to change your line applying it to replace with $2 instead.

If you wanted to make a list of accepted tags ( which still seems better to me, but I do not know your use case ), you could replace the [a-z]+ with, for example, a list of acceptable tags joined by |.

$expr = qr{
    <\s*([a-z]+)(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|<\/\s*\1\s*>)).)+ )
      |
        (??{ $expr })
    )*)
    <\/\s*\1\s*>
  }x;

A short example script with a tag:

#!/usr/bin/env perl

use strict;
use warnings;

my $expr;

$expr = qr{
    <\s*([a-z]+)(?:\s*|\s+[^>]+)>
    ((?:
        (?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|<\/\s*\1\s*>)).)+ )
      |
        (??{ $expr })
    )*)
    <\/\s*\1\s*>
  }x;


my $tmp_text = 'a<b> e </b>c<b href="test">g <b> d</b> d</b>f';
print $tmp_text."\n";

print $tmp_text."\n" while $tmp_text =~s/$expr/$2/g;

Wiktor has posted a regex in comments which also allows for capital letters and '_' - if that is what you want, just replace [a-z] with [a-zA-Z_] as in his example.

Upvotes: 2

Related Questions