Reputation: 1
I'm trying to scrape comments from a popular news site for an academic study using curl. It works fine for articles with <300 comments but after that it struggles.
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
echo $html; //just to see what's been scraped
At the moment this page works fine: http://www.guardian.co.uk/commentisfree/2012/aug/22/letter-from-india-women-drink?commentpage=all#start-of-comments
But this one only returns 36 comments despite there being 700+ in total: http://www.guardian.co.uk/commentisfree/2012/aug/21/everyones-talking-about-rape?commentpage=all#start-of-comments
Why is it struggling for articles with a ton of comments?
Upvotes: 0
Views: 380
Reputation: 55962
You comments page is pageinated. Each page contains differerent comments. You will have to request all comment pagination links.
The parameter page=x
is appended to the url for a different page.
It might be good to get base page then search for all links with page paarameter and request each of those in turn?
As Mike Christensen pointed out if you could use python and scrapy that functionality is built in. You just have to specify the element the comment is located in and python will crawl all links on the page for you:)
Upvotes: 2