Reputation: 443
I am trying to echo site data & for 95% of sites file_get_content, curl works just fine but for few sites, it never works whatever I tried. I tried to define proper user agent, changes SSL verify to false but nothing worked.
test site where it fails with forbidden https://norskbymiriams.dk/
wget is unable to copy ssl sites however wget is compiled with ssl support. checked with wget -V
i tried these codes.none worked for this particular site
file_get_contents
$list_url = "https://norskbymiriams.dk/";
$html = file_get_contents($list_url);
echo $html;
curl
$handle=curl_init('https://norskbymiriams.dk');
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($handle);
echo $content;
any help will be great
Upvotes: 1
Views: 649
Reputation: 1892
Some websites analyse a request extremely good. If there is a single thing that makes that web server think you are a crawling bot, it might return 403.
I would try this:
make a request from browser, see all request headers, and place them in my curl request (simulate a real browser).
my curl request would look like this:
curl 'https://norskbymiriams.dk/'
-H 'Upgrade-Insecure-Requests: 1'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
--compressed
Please try it. it works.
You can make a request in Chrome for example, and use Network tab from Developer tools to inspect a page request. If you right click on it, you will see Copy as cURL
Therefore test each header separately in your actual cURL request, see which is the missing link, then add it and continue your crawling.
Upvotes: 1