I need to get a short excerpt of news items written in HTML to show on my front page. Obviously I can't use something as simple as substr because it might leave tags unclosed or even leave half a tag. Which is easier: Converting the HTML to decent looking plain text and take a piece of that Taking the beginning from the HTML and closing any unclosed tags at the cutoff (will this always look OK?) And how would I go about implementing the chosen solution?

Reputation: 44114

Getting an excerpt from HTML in PHP

I need to get a short excerpt of news items written in HTML to show on my front page. Obviously I can't use something as simple as substr because it might leave tags unclosed or even leave half a tag.

Which is easier:

Converting the HTML to decent looking plain text and take a piece of that
Taking the beginning from the HTML and closing any unclosed tags at the cutoff (will this always look OK?)

And how would I go about implementing the chosen solution?

Upvotes: 4

Answers (6)

vatavale

Reputation: 1620

Sometimes it's better to take, for example, first two paragraphs by using regex with groups and lazy quantifiers.

function excerpt_from_html($str) {
    $re = '/(<p>\X*?<\/p>)\X*?(<p>\X*?<\/p>)/u';
    preg_match($re, $str, $matches);
    return $matches[1] . $matches[2];
}

Or you can take 3-4 paragraphs and make a decision how many of them to show up based on the length of the excerpt.

Upvotes: 0

streetparade

Reputation: 32908

Hello I guess what you are looking for is called website scraping. Here is how you can scrape a website; Use a library PHP Simple HTML DOM Parser download here PHP Simple HTML DOM Parser

And finally here is the code how you can scrape Slashdot

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']   = $article->find('div.title', 0)->plaintext;
    $item['intro']   = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

Upvotes: 2

33v

Reputation: 113

This excerpts down to the first paragraph without cutting words and appends optional trail.

$excerpt = self::excerpt_paragraph($html, 180)

/**
* excerpt first paragraph from html content
* 
**/
public static function excerpt_paragraph($html, $max_char = 100, $trail='...' )
{
    // temp var to capture the p tag(s)
    $matches= array();
    if ( preg_match( '/<p>[^>]+<\/p>/', $html, $matches) )
    {
        // found <p></p>
        $p = strip_tags($matches[0]);
    } else {
        $p = strip_tags($html);
    }
    //shorten without cutting words
    $p = self::short_str($p, $max_char );

    // remove trailing comma, full stop, colon, semicolon, 'a', 'A', space
    $p = rtrim($p, ',.;: aA' );

    // return nothing if just spaces or too short
    if (ctype_space($p) || $p=='' || strlen($p)<10) { return ''; }

    return '<p>'.$p.$trail.'</p>';
}
//

/**
* shorten string but not cut words
* 
**/
public static function short_str( $str, $len, $cut = false )
{
    if ( strlen( $str ) <= $len ) { return $str; }
    $string = ( $cut ? substr( $str, 0, $len ) : substr( $str, 0, strrpos( substr( $str, 0, $len ), ' ' ) ) );
    return $string;
}
//

Upvotes: 3

Richard Nguyen

Reputation: 1321

I would take the 2nd option if it's important to retain the HTML structure of the original news item.

A simple way to implement this would be to run your fragment through Tidy to close off any unclosed tags. In particular, see the tidy::cleanRepair method.

Upvotes: 3

cimnine

Reputation: 4067

You could try parsing your data to XML and then truncating only the "pure" text nodes.

Note: This solution forces the input to be valid XML and to be always in about the same structure.

Upvotes: 1

Ben James

Reputation: 125207

Simplest way is to strip all HTML from the item text using strip_tags() before truncating it.

Upvotes: 8

Getting an excerpt from HTML in PHP

Answers (6)

Related Questions