Arun Kumar
Arun Kumar

Reputation: 83

file_get_contents returns nothing on html input

file_get_contents() returns proper file contents on www.akaar.org but not on www.ptsda.org.

The main difference is that akaar.org is a php project and ptsda.org is html.

Basically I am building a web crawler in php. It didn't crawl through that particular site, when I successfully crawled through at least 150+ sites.

Upvotes: 0

Views: 646

Answers (5)

Arun Kumar
Arun Kumar

Reputation: 83

Finally found the solution.

I saved the page as a HTML and gave input to my php crawler.

<?php

     $contents = file_get_contents("The downloaded HTML file");
     print_r($contents);
?>

SUCCESS :)

Thanks to all for replying.

Upvotes: 0

Nikhil
Nikhil

Reputation: 1450

Here the reason why certain websites doesn't allow to crawl.

  1. file_get_contents('http://www.akaar.org/') You can get result from website, which means the server where this website is hosted is not configured filewall to block crawl requests.
  2. file_get_contents('http://www.ptsda.org/') In this case you will get HTTP request failed! HTTP/1.1 403 ModSecurity as output, which means the server is configured with Firewall and you won't get the response. Read more about ModSecurity.

Here is the solution, try to use CURL instead of file_get_contents. Note: This is a work around.

<?php
    $curl_handle=curl_init();
    curl_setopt($curl_handle, CURLOPT_URL,'http://www.ptsda.org/');
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl_handle, CURLOPT_USERAGENT, 'ptsda');
    $query = curl_exec($curl_handle);
    curl_close($curl_handle);
    //print_r($query);
?>

Upvotes: 2

phonetic
phonetic

Reputation: 85

Your problem is that the host of ptsda.org is returning this 403 (Forbidden) error:

file_get_contents("http://www.ptsda.org"): failed to open stream: HTTP request failed! HTTP/1.1 403 ModSecurity

This shows they have protection in place to stop their content from being crawled by bots. You might be able to bypass this, by setting a useragent string within PHP (See this question).

Upvotes: 2

danjam
danjam

Reputation: 1062

ptsda.org is returning this 403 (forbidden) error:

failed to open stream: HTTP request failed! HTTP/1.1 403 ModSecurity Action

So it looks like they have Apache ModSecurity protection in place to stop their content from being scraped in this way.

Upvotes: 2

online Thomas
online Thomas

Reputation: 9381

http://www.ptsda.org/

Is a flash site, which cannot be crawled that easily as HTML would enter image description here

Upvotes: 1

Related Questions