Reputation: 83
file_get_contents() returns proper file contents on www.akaar.org but not on www.ptsda.org.
The main difference is that akaar.org is a php project and ptsda.org is html.
Basically I am building a web crawler in php. It didn't crawl through that particular site, when I successfully crawled through at least 150+ sites.
Upvotes: 0
Views: 646
Reputation: 83
Finally found the solution.
I saved the page as a HTML and gave input to my php crawler.
<?php
$contents = file_get_contents("The downloaded HTML file");
print_r($contents);
?>
SUCCESS :)
Thanks to all for replying.
Upvotes: 0
Reputation: 1450
Here the reason why certain websites doesn't allow to crawl.
file_get_contents('http://www.akaar.org/')
You can get result from website, which means the server where this website is hosted is not configured filewall to block crawl requests.file_get_contents('http://www.ptsda.org/')
In this case you will get HTTP request failed! HTTP/1.1 403 ModSecurity
as output, which means the server is configured with Firewall and you won't get the response. Read more about ModSecurity.Here is the solution, try to use CURL instead of file_get_contents. Note: This is a work around.
<?php
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL,'http://www.ptsda.org/');
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'ptsda');
$query = curl_exec($curl_handle);
curl_close($curl_handle);
//print_r($query);
?>
Upvotes: 2
Reputation: 85
Your problem is that the host of ptsda.org is returning this 403 (Forbidden) error:
file_get_contents("http://www.ptsda.org"): failed to open stream: HTTP request failed! HTTP/1.1 403 ModSecurity
This shows they have protection in place to stop their content from being crawled by bots. You might be able to bypass this, by setting a useragent string within PHP (See this question).
Upvotes: 2
Reputation: 1062
ptsda.org is returning this 403 (forbidden) error:
failed to open stream: HTTP request failed! HTTP/1.1 403 ModSecurity Action
So it looks like they have Apache ModSecurity protection in place to stop their content from being scraped in this way.
Upvotes: 2
Reputation: 9381
Is a flash site, which cannot be crawled that easily as HTML would
Upvotes: 1