Reputation: 10887

Prevent to Be Crawled by a Script

I was trying to read a page from the same site using PHP. I came across this good discussion and decided to use the cURL method suggested:

function get_web_page( $url )
{
    $options = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
        CURLOPT_TIMEOUT        => 120,      // timeout on response
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
    );

    $ch      = curl_init( $url );
    curl_setopt_array( $ch, $options );
    $content = curl_exec( $ch );
    $err     = curl_errno( $ch );
    $errmsg  = curl_error( $ch );
    $header  = curl_getinfo( $ch );
    curl_close( $ch );

    $header['errno']   = $err;
    $header['errmsg']  = $errmsg;
    $header['content'] = $content;
    return $header;
}

//Now get the webpage
$data = get_web_page( "https://www.google.com/" );

//Display the data (optional)
echo "<pre>" . $data['content'] . "</pre>";

So, for my case, I called the get_web_page like this:

$target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html";           
$page = get_web_page($target_url);

The thing that I couldn't fathom is it worked on all of my test servers but one. I've verified that the cURL is available on the server in question. Also, setting `$target_url = "http://www.google.com" worked fine. So, I am pretty positive that the culprit has nothing to do with the cURL library.

Can it be because some servers block themselves from being "crawled" by this type of script? Or, maybe I just missed something here?

Thanks beforehand.

Answers (3)

moey

Reputation: 10887

It turned out that there's nothing wrong with the above script. And yes, $target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html"; returned the intended value (as questioned by @ajreal in his answer).

The problem was actually due to how the IP (of the target page) was being resolved, which makes the answer to this question not related to PHP nor Apache: when I ran the script on the server under test, the returned IP address wasn't accessible. Please refer to this more detailed explanation / discussion.

One take away: please first try curl -v from the command line, which might give you useful clues.

Upvotes: 0

a sad dude

Reputation: 2825

Try using HTTP_HOST instead of SERVER_NAME. They're not quite the same.

Upvotes: 0

ajreal

Reputation: 47321

$target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html";

I not sure the above expression is actually return the correct URL for you,
this might the cause of all problem.

Can it be because some servers block themselves from being "crawled" by this type of script?

Yes, it could be.
But I don't have the answer, because you did not put in the implementation details.
This is your site, you should able to check.

In a general, I would say this is a bad idea,
if you are trying to access another page from the same domain,
you can just simply do file_get_contents(PATH_TO_FILE.'/press-release/index.html');
(judge by the extension HTML, I assume that is static page)

If that page is require some PHP processing,
well, you just need to prepare all the necessary variables ... then require the file.

Upvotes: 2

Prevent to Be Crawled by a Script

Answers (3)

Related Questions