Reputation: 2637
I am using the following perl regular expression to clean xml/html style formating tags from input.
$expr = qr{
<\s*a(?:\s*|\s+[^>]+)>
((?:
(?> (?:(?!(<\s*a(?:\s*|\s+[^>]+)>|<\/\s*a\s*>)).)+ )
|
(??{ $expr })
)*)
<\/\s*a\s*>
}x;
Applying it recursively it will remove nested <a>...</a>
tags (not that this would make sense if <a>
makes a hyperlink) and keep only the bracketed text:
my $tmp_text = "a<a> e </a>c<a href="test">g <a> d</a> d</a>f";
print $tmp_text."\n";
$tmp_text=~s/$expr/$1/g;
print $tmp_text."\n";
$tmp_text=~s/$expr/$1/g;
print $tmp_text."\n";
This will print
a<a> e </a>c<a href="test">g <a> d</a> d</a>f
a e cg <a> d</a> df
a e cg d df
Now, I would like to do the same with all other formatting tags, like <b>..</b>
and so on. I can surely make a list of all supported tags, replace a
with b
etc. in $expr
, and repeat the substitution with each of them.
However, I wonder if there is a more efficient/compact way by modifying $expr
such that it will do balanced matching for whatever name
is in <name something>...</name>
.
Note that I consciously avoid using perl packages for xml/html parsing or cleaning tools. The input I am processing is not strict html and I do not want to include dependencies.
Upvotes: 0
Views: 109
Reputation: 1578
I believe this meets your stated requirements:
I replaced the 'a' in the regex with a [a-z]+, captured and backreferenced it. That does mean you have to change your line applying it to replace with $2
instead.
If you wanted to make a list of accepted tags ( which still seems better to me, but I do not know your use case ), you could replace the [a-z]+
with, for example, a list of acceptable tags joined by |
.
$expr = qr{
<\s*([a-z]+)(?:\s*|\s+[^>]+)>
((?:
(?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|<\/\s*\1\s*>)).)+ )
|
(??{ $expr })
)*)
<\/\s*\1\s*>
}x;
A short example script with a tag:
#!/usr/bin/env perl
use strict;
use warnings;
my $expr;
$expr = qr{
<\s*([a-z]+)(?:\s*|\s+[^>]+)>
((?:
(?> (?:(?!(<\s*\1(?:\s*|\s+[^>]+)>|<\/\s*\1\s*>)).)+ )
|
(??{ $expr })
)*)
<\/\s*\1\s*>
}x;
my $tmp_text = 'a<b> e </b>c<b href="test">g <b> d</b> d</b>f';
print $tmp_text."\n";
print $tmp_text."\n" while $tmp_text =~s/$expr/$2/g;
Wiktor has posted a regex in comments which also allows for capital letters and '_' - if that is what you want, just replace [a-z]
with [a-zA-Z_]
as in his example.
Upvotes: 2