Ivijan Stefan Stipić
Ivijan Stefan Stipić

Reputation: 6668

Extracting the body HTML and clean comments using PHP and Regex

I want to clean the comments and some other garbage or tags from the <body> section in HTML using PHP and regex but my code not work:

$str=preg_replace_callback('/<body>(.*?)<\/body>/s', 
    function($matches){
        return '<body>'.preg_replace(array(
            '/<!--(.|\s)*?-->/',
        ),
        array(
            '',
        ), $matches[1]).'</body>';
    }, $str);

The problem is that nothing happens. Comments will remain where they are or any cleaning to do, nothing happens. Can you help? Thanks!

EDIT:

Thanks to @mhall I figureout that my regex not work becouse of attributes in <body> tag. I use his code and update this:

$str = preg_replace_callback('/(?=<body(.*?)>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[2]);
    }, $str);

This work PERFECT!

Thanks people!

Upvotes: 2

Views: 301

Answers (2)

alexis
alexis

Reputation: 50190

Aren't you making it too complicated? You don't need to jump in and out via a callback, since preg_replace will make replacements at every match:

$parts = explode("<body", $str, 2);
$clean = preg_replace('/<!--.*?-->/s', '', $parts[1]);
$str = $parts[0]."<body".$clean;

Splitting the string into head and body excludes the head from substitution without a lot of messy regexps. Note the s after the pattern: '/.../s'. This makes the dot in the regexp match embedded newlines along with other characters.

Upvotes: 0

mhall
mhall

Reputation: 3701

Try this. Made a modification on the preg_replace_callback not to include the body tags and replaced (.|\s) with a .* in preg_replace. Also dropped the array syntax from that and added a /s modifier:

$str = <<<EOS
<html>
    <body>
        <p>
             Here is some <!-- One comment --> text
             with a few <!--
                Another comment
             -->
             Comments in it
        </p>
    </body>
</html>
EOS;

$str = preg_replace_callback('/(?=<body>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[1]);
    }, $str);

echo $str, PHP_EOL;

Output:

<html>
    <body>
        <p>
             Here is some  text
             with a few 
             Comments in it
        </p>
    </body>
</html>

Upvotes: 2

Related Questions