Eric
Eric

Reputation: 81

How to remove any html tags with nothing but optional whitespace between them

I need to clean up some VERY ugly html (think < span>< /span> < em>< /em> < em> < /em> < strong>< /strong> ) over and over again...

I'm looking for a nice and easy preg_replace to eliminate any html tags that contain optional whitespace between them. Your assistance is greatly appreciated!

Oh, and just found this beauty:

< p>< strong>< strong>< /strong>< /strong>< /p>

looks like this will need to live in a while loop as well.

Upvotes: 0

Views: 331

Answers (4)

Peter Bailey
Peter Bailey

Reputation: 105908

It's funny how this topic keeps coming up.

Don't go with regex. Try HTML Tidy instead.

Upvotes: 5

Eric
Eric

Reputation: 81

Well, it looks like tidy WAS the answer:

function cleanupcrap($html){
$tidy_config = array( 
     'clean' => true, 
     'output-xhtml' => true, 
     'show-body-only' => true, 
     'wrap' => 0,
     ); 

    $tidy = tidy_parse_string($html, $tidy_config, 'UTF8'); 
    $tidy->cleanRepair(); 
    return $tidy->value;

}

Upvotes: 0

Thom Smith
Thom Smith

Reputation: 14086

If you really want a regex, here's one:

s:<(\w+)>\s*<\/\1>::g

Run it multiple times to eliminate nested cases.

Upvotes: 0

jheddings
jheddings

Reputation: 27573

If you are looking to really clean up some code, I'd suggest the Tidy class in PHP. There are some examples that might help get you started. (Note this is a front-end to HTML Tidy)

Upvotes: 2

Related Questions