icodestuff
icodestuff

Reputation: 350

How can I remove a table from an HTML document?

I'm upgrading a set of web pages to a new system, and I want to strip out and replace the boilerplate at the top of each page, and replace it with new boilerplate. Fortunately, each page has a content table, and no tables before it. I want to do something like:

$contents =~ s/^.*<table/$newHeader/

This only works for the first line of $contents. Is there a way to replace everything before (and including) the first <table in the file with my new boilerplate?

Upvotes: 1

Views: 408

Answers (2)

Clinton Pierce
Clinton Pierce

Reputation: 13209

The "." normally matches any character except a newline. Append "s" onto your regexp to make it match over multiple lines:

 $contents =~ s/^.*?<table/$newHeader/s;

Upvotes: 3

Adam Batkin
Adam Batkin

Reputation: 53034

You could use Perl's "/s" option which tells it that "." matches all characters including newlines (deal with the string as a single giant line instead of per-line). You limit the match to the first table by using the ? quantifier to make the * non-greedy:

$contents =~ s/^.*?<table/$newHeader/s

Also, just remember that the replacement will also strip out the text "<table" so you will need to make sure that it gets inserted back in somehow, possibly with:

$contents =~ s/^.*?<table/<table$newHeader/s

Or you can use a zero-width positive look-ahead assertion, which says "following the match, this expression must also match" but the text in the lookahead assertion is not considered part of the match (and therefore won't be replaced):

$contents =~ s/^.*?(?=<table)/$newHeader/s

And that will leave the "<table" intact.

Upvotes: 7

Related Questions