Reputation: 1830
For some reason I can't seem to get this particular web page's contents via cURL. I've managed to use cURL to get to the "top level page" contents fine, but the same self-built quick cURL function doesn't seem to work for one of the linked off sub web pages.
Top level page: http://www.deindeal.ch/
A sub page: http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/
My cURL function (in functions.php)
function curl_get($url) {
$ch = curl_init();
$header = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept-Language: en-us;q=0.8,en;q=0.6'
);
$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => 0,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
CURLOPT_HTTPHEADER => $header
);
curl_setopt_array($ch, $options);
$return = curl_exec($ch);
curl_close($ch);
return $return;
}
PHP file to get the contents (using echo for testing)
require "functions.php";
require "phpQuery.php";
echo curl_get('http://www.deindeal.ch/deals/hotel-walliserhof-zermatt-2-naechte-30/');
So far I've attempted the following to get this to work
curl_get()
contained all the options as current, except for CURLOPT_USERAGENTand
CURLOPT_HTTPHEADERS`.Is it possible for a website to completely block requests via cURL or other remote file opening mechanisms, regardless of how much data is supplied to attempt to make a real browser request?
Also, is it possible to diagnose why my requests are turning up with nothing?
Any help answering the above two questions, or editing/making suggestions to get the file's contents, even if through a method different than cURL would be greatly appreciated ;).
Upvotes: 3
Views: 7846
Reputation: 132028
Try adding:
CURLOPT_FOLLOWLOCATION => TRUE
to your options.
If you run a simple curl request from the command line (including a -i
to see the response headers) then it is pretty easy to see:
$ curl -i 'http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/'
HTTP/1.1 302 FOUND
Date: Fri, 30 Dec 2011 02:42:54 GMT
Server: Apache/2.2.16 (Debian)
Vary: Accept-Language,Cookie,Accept-Encoding
Content-Language: de
Set-Cookie: csrftoken=d127d2de73fb3bd72e8986daeca86711; Domain=www.deindeal.ch; Max-Age=31449600; Path=/
Set-Cookie: generic_cookie=1; Path=/
Set-Cookie: sessionid=987b1a11224ecd0e009175470cf7317b; expires=Fri, 27-Jan-2012 02:42:54 GMT; Max-Age=2419200; Path=/
Location: http://www.deindeal.ch/welcome/?deal_slug=hotel-cristal-in-nuernberg-30
Content-Length: 0
Connection: close
Content-Type: text/html; charset=utf-8
As you can see, it returns a 302 with a Location header. If you hit that location directly, you will get the content you are looking for.
And to answer your two questions:
EDIT
Ah, I see what you are talking about now. So, when you go to that link for the first time you are redirected and a cookie (or cookies) is set. Once you have those cookie, your request goes through as intended.
So, you need to use a cookiejar, like in this example: http://icfun.blogspot.com/2009/04/php-how-to-use-cookie-jar-with-curl.html
So, you will need to make an initial request, save the cookies, and make your subsequent requests including the cookies after that.
Upvotes: 5