user2129623
user2129623

Reputation: 2257

Removing title content from page html

Here I am creating preview for url. Which shows

  1. Url title
  2. Url description (title should not come in this)

Here is my try.

<?php
function plaintext($html)
    {
        $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);

        // remove title 
            //$plaintext = preg_match('#<title>(.*?)</title>#', $html);

        // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
        $plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);

        // put a space between list items (strip_tags just removes the tags).
            $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);     

            // remove all script and style tags
        $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

        // remove br tags (missed by strip_tags)
            $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

            // remove all remaining html
            $plaintext = strip_tags($plaintext);

        return $plaintext;
    }
        function get_title($html) 
    {
        return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
    }
        function trim_display($size,$string)
    {
        $trim_string = substr($string, 0, $size);

        $trim_string = $trim_string . "...";
        return $trim_string;
    }

$url = "http://www.nextbigwhat.com/indian-startups/";
$data = file_get_contents($url);
//$url = trim_url(5,$url);      
    $title = get_title($data);
        echo "title is ; $title";   
    $content = plaintext($data); 
    $Preview = trim_display(100,$content);
echo '<br/>';
echo "preview is: $Preview";

?>

URL title appear correctly. But when I have excluded the title content from description, even it appear.

i have uses $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html); to exclude the title from plain text.

Regex is correct as per me event it does not exclude title content.

What is the problem here?

output we get here is:

title is ; Indian Startups Archives - NextBigWhat.com
preview is: Indian Startups Archives : NextBigWhat.com [whatever rest text]...

Actually the text which appears in title part should not again come in preview. That's why i want to exclude it and display rest text in preview.

Upvotes: 0

Views: 2000

Answers (1)

martriay
martriay

Reputation: 5742

how to solve the mistery

If you look closer to the title and the preview, they're different. Let's see the output from the curl.

echo plaintext($data);

Well, it seems it has two titles:

<title>
Indian Startups Archives : NextBigWhat.com</title>

and

<title>Indian Startups Archives - NextBigWhat.com</title>

Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!

tl;dr

Your regex is wrong, add 's' to it.

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`

instead of

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`

Upvotes: 2

Related Questions