absk
absk

Reputation: 673

file_get_contents returns 403 forbidden

I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error. I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:

Warning: file_get_contents(http://example.com/viewProperty.html?id=7715888) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /home/scraping/simple_html_dom.php on line 40

The line of code triggering it is:

$url="http://www.example.com/viewProperty.html?id=".$id;

$html=file_get_html($url);

I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.

Upvotes: 44

Views: 145922

Answers (12)

Ikari
Ikari

Reputation: 3246

PHP provides some means to debug such errors, namely

  • a special $http_response_header variable, which gets populated after each file_get_contents() call with response HTTP headers,
  • and ignore_errors context option. By setting it, you'll be able to get the actual response, which would likely explain why you are getting a 403 response.

Refer to this answer for the details.

From a practical point of view, most likely your request lacks some required HTTP header. For example, it could be Referer or User-Agent.

A list of common user agents used by browsers are listed below:

  • Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

  • Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0

  • etc...


$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("www.google.com", false, $context);

This code fakes the user agent and sends the request to https://google.com.

References:

Cheers!

Upvotes: 89

namal
namal

Reputation: 1284

Test your API endpoint with Postman. It will provide more details of the error. In my case, It said the referer is empty. You can add the referer in the header.

curl example

curl_setopt(
  $handle,
  CURLOPT_HTTPHEADER,
    [
      'Content-Type: application/json',
      'Content-Length: ' . strlen($data_string),
      'Referer: https://test.com'
    ]
);

file_get_content example

$header = array(
  "Content-Type: application/x-www-form-urlencoded",
  "Referer: https://test.com",
);
$opts = array('http' =>
  array(
    'method' => 'POST',
    'header' => implode("\r\n", $header),
    'content' => $postdata
  )
);

This happened on the Google API with restriction mode. So another solution is removing the restrictions

enter image description here

Upvotes: 0

Vijay Richards
Vijay Richards

Reputation: 111

Add this after you include the simple_html_dom.php

ini_set('user_agent', 'My-Application/2.5');

Upvotes: 11

Daniel Renteria
Daniel Renteria

Reputation: 375

You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.

$context = stream_context_create(
        array(
            "http" => array(
                'method'=>"GET",
                "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) 
                            AppleWebKit/537.36 (KHTML, like Gecko) 
                            Chrome/50.0.2661.102 Safari/537.36\r\n" .
                            "accept: text/html,application/xhtml+xml,application/xml;q=0.9,
                            image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
                            "accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" . 
                            "accept-encoding: gzip, deflate, br\r\n"
            )
        )
    );

Upvotes: 1

sac
sac

Reputation: 97

Use below code: if you use -> file_get_contents

$context  = stream_context_create(
  array(
    "http" => array(
      "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
    )
));

========= if You use curl,

curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');

Upvotes: 0

Steven
Steven

Reputation: 1243

In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.

Upvotes: 0

CrookedCreek
CrookedCreek

Reputation: 31

I realize this is an old question, but...

Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.

Upvotes: 2

r0adtr1p
r0adtr1p

Reputation: 61

Write this in simple_html_dom.php for me it worked

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
    //$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

}

Upvotes: 3

Andrea Syd Coi
Andrea Syd Coi

Reputation: 11

Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.

Upvotes: 1

Sergi
Sergi

Reputation: 1256

It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:

$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);

Upvotes: 7

Dejan Marjanović
Dejan Marjanović

Reputation: 19380

You can change it like this in parser class from line 35 and on.

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html()
{
  $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
}

Have you tried other site?

Upvotes: 6

Pekka
Pekka

Reputation: 449783

This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.

It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.

You should probably talk to the administrator of the remote server.

Upvotes: 23

Related Questions