Steeve
Steeve

Reputation: 443

file_get_contents, curl, wget fails with 403 response

I am trying to echo site data & for 95% of sites file_get_content, curl works just fine but for few sites, it never works whatever I tried. I tried to define proper user agent, changes SSL verify to false but nothing worked.

test site where it fails with forbidden https://norskbymiriams.dk/

wget is unable to copy ssl sites however wget is compiled with ssl support. checked with wget -V

i tried these codes.none worked for this particular site

file_get_contents

$list_url = "https://norskbymiriams.dk/";
$html = file_get_contents($list_url);
echo $html;


curl


$handle=curl_init('https://norskbymiriams.dk');
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

curl_setopt($handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($handle);

echo $content; 

any help will be great

Upvotes: 1

Views: 649

Answers (1)

besciualex
besciualex

Reputation: 1892

Some websites analyse a request extremely good. If there is a single thing that makes that web server think you are a crawling bot, it might return 403.

I would try this:

  1. make a request from browser, see all request headers, and place them in my curl request (simulate a real browser). enter image description here

  2. my curl request would look like this:

curl 'https://norskbymiriams.dk/'
-H 'Upgrade-Insecure-Requests: 1'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
--compressed

Please try it. it works.

  1. You can make a request in Chrome for example, and use Network tab from Developer tools to inspect a page request. If you right click on it, you will see Copy as cURL enter image description here

  2. Therefore test each header separately in your actual cURL request, see which is the missing link, then add it and continue your crawling.

Upvotes: 1

Related Questions