Reputation: 44114
I need to get a short excerpt of news items written in HTML to show on my front page. Obviously I can't use something as simple as substr
because it might leave tags unclosed or even leave half a tag.
Which is easier:
And how would I go about implementing the chosen solution?
Upvotes: 4
Views: 3898
Reputation: 1620
Sometimes it's better to take, for example, first two paragraphs by using regex with groups and lazy quantifiers.
function excerpt_from_html($str) {
$re = '/(<p>\X*?<\/p>)\X*?(<p>\X*?<\/p>)/u';
preg_match($re, $str, $matches);
return $matches[1] . $matches[2];
}
Or you can take 3-4 paragraphs and make a decision how many of them to show up based on the length of the excerpt.
Upvotes: 0
Reputation: 32908
Hello I guess what you are looking for is called website scraping. Here is how you can scrape a website; Use a library PHP Simple HTML DOM Parser download here PHP Simple HTML DOM Parser
And finally here is the code how you can scrape Slashdot
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
Upvotes: 2
Reputation: 113
This excerpts down to the first paragraph without cutting words and appends optional trail.
$excerpt = self::excerpt_paragraph($html, 180)
/**
* excerpt first paragraph from html content
*
**/
public static function excerpt_paragraph($html, $max_char = 100, $trail='...' )
{
// temp var to capture the p tag(s)
$matches= array();
if ( preg_match( '/<p>[^>]+<\/p>/', $html, $matches) )
{
// found <p></p>
$p = strip_tags($matches[0]);
} else {
$p = strip_tags($html);
}
//shorten without cutting words
$p = self::short_str($p, $max_char );
// remove trailing comma, full stop, colon, semicolon, 'a', 'A', space
$p = rtrim($p, ',.;: aA' );
// return nothing if just spaces or too short
if (ctype_space($p) || $p=='' || strlen($p)<10) { return ''; }
return '<p>'.$p.$trail.'</p>';
}
//
/**
* shorten string but not cut words
*
**/
public static function short_str( $str, $len, $cut = false )
{
if ( strlen( $str ) <= $len ) { return $str; }
$string = ( $cut ? substr( $str, 0, $len ) : substr( $str, 0, strrpos( substr( $str, 0, $len ), ' ' ) ) );
return $string;
}
//
Upvotes: 3
Reputation: 1321
I would take the 2nd option if it's important to retain the HTML structure of the original news item.
A simple way to implement this would be to run your fragment through Tidy to close off any unclosed tags. In particular, see the tidy::cleanRepair method.
Upvotes: 3
Reputation: 4067
You could try parsing your data to XML and then truncating only the "pure" text nodes.
Note: This solution forces the input to be valid XML and to be always in about the same structure.
Upvotes: 1
Reputation: 125207
Simplest way is to strip all HTML from the item text using strip_tags()
before truncating it.
Upvotes: 8