Reputation: 81
I need to clean up some VERY ugly html (think < span>< /span> < em>< /em> < em> < /em> < strong>< /strong> ) over and over again...
I'm looking for a nice and easy preg_replace to eliminate any html tags that contain optional whitespace between them. Your assistance is greatly appreciated!
Oh, and just found this beauty:
< p>< strong>< strong>< /strong>< /strong>< /p>
looks like this will need to live in a while loop as well.
Upvotes: 0
Views: 331
Reputation: 105908
It's funny how this topic keeps coming up.
Don't go with regex. Try HTML Tidy instead.
Upvotes: 5
Reputation: 81
Well, it looks like tidy WAS the answer:
function cleanupcrap($html){
$tidy_config = array(
'clean' => true,
'output-xhtml' => true,
'show-body-only' => true,
'wrap' => 0,
);
$tidy = tidy_parse_string($html, $tidy_config, 'UTF8');
$tidy->cleanRepair();
return $tidy->value;
}
Upvotes: 0
Reputation: 14086
If you really want a regex, here's one:
s:<(\w+)>\s*<\/\1>::g
Run it multiple times to eliminate nested cases.
Upvotes: 0
Reputation: 27573
If you are looking to really clean up some code, I'd suggest the Tidy class in PHP. There are some examples that might help get you started. (Note this is a front-end to HTML Tidy)
Upvotes: 2