charlieperry
charlieperry

Reputation: 1

Using curl for scraping large pages

I'm trying to scrape comments from a popular news site for an academic study using curl. It works fine for articles with <300 comments but after that it struggles.

$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
echo $html; //just to see what's been scraped

At the moment this page works fine: http://www.guardian.co.uk/commentisfree/2012/aug/22/letter-from-india-women-drink?commentpage=all#start-of-comments

But this one only returns 36 comments despite there being 700+ in total: http://www.guardian.co.uk/commentisfree/2012/aug/21/everyones-talking-about-rape?commentpage=all#start-of-comments

Why is it struggling for articles with a ton of comments?

Upvotes: 0

Views: 380

Answers (1)

dm03514
dm03514

Reputation: 55962

You comments page is pageinated. Each page contains differerent comments. You will have to request all comment pagination links.

The parameter page=x is appended to the url for a different page.

It might be good to get base page then search for all links with page paarameter and request each of those in turn?

As Mike Christensen pointed out if you could use python and scrapy that functionality is built in. You just have to specify the element the comment is located in and python will crawl all links on the page for you:)

Upvotes: 2

Related Questions