Reputation: 21381

How to extract the headline and content from a crawled web page / article?

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.

Upvotes: 1

Answers (1)

Pekka

Reputation: 449515

You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.

They have an example on how to scrape Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

Upvotes: 1

How to extract the headline and content from a crawled web page / article?

Answers (1)

Related Questions