Reputation: 139
I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed?
Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa">
.
Upvotes: 0
Views: 185
Reputation: 41838
Resurrecting this ancient question because it had a simple solution that wasn't mentioned.
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '<[^>]*|(a)';
$subject = 'aig arother <a href="aaa">';
($replaced = $subject) =~ s/$regex/
if (defined $1) {"b";} else {$&;} /eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Upvotes: 0
Reputation: 4795
As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work:
a(?![^<]*>)
Working example on RegExr and the same for input.
$var = "salut <a href='a.html'></a> ah ha <a href='about.asp' /> animal";
# ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
$var =~ s/a(?![^<]*>)/b/g;
print $var;
Output:
sblut <a href='a.html'></a> bh hb <a href='about.asp' /> bnimbl
^ ^ ^ ^ ^
Upvotes: 2