Tom
Tom

Reputation: 139

Replace characters in an HTML document that match a regex, except those inside tags

I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed?

Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa">.

Upvotes: 0

Views: 185

Answers (2)

zx81
zx81

Reputation: 41838

Resurrecting this ancient question because it had a simple solution that wasn't mentioned.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

#!/usr/bin/perl
$regex = '<[^>]*|(a)';
$subject = 'aig arother <a href="aaa">';
($replaced = $subject) =~ s/$regex/
if (defined $1)  {"b";} else {$&;} /eg;
print $replaced . "\n";

See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...

Upvotes: 0

OGHaza
OGHaza

Reputation: 4795

As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work:

a(?![^<]*>)

Working example on RegExr and the same for input.

And in Perl:

$var = "salut <a href='a.html'></a> ah ha <a href='about.asp' /> animal";
#        ^     ^       ^         ^  ^   ^  ^       ^     ^       ^   ^
$var =~ s/a(?![^<]*>)/b/g;
print $var;

Output:

sblut <a href='a.html'></a> bh hb <a href='about.asp' /> bnimbl
 ^                          ^   ^                        ^   ^

Upvotes: 2

Related Questions