Reputation: 91
I'm writing a scraper using Curl and I found that a lot of pages include multiple redirecting headers, like:
HTTP/1.1 302 Moved Temporarily
Server: nginx/1.0.4
Date: Thu, 17 Nov 2011 17:46:35 GMT
Transfer-Encoding: chunked
Location: http://secure.domain.net/track/NDg6MTE6MTU/?autocamp=TJ_ABC_VA_A02
HTTP/1.1 302 Found
Date: Thu, 17 Nov 2011 17:46:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: nats_cookie=Bookmark; expires=Fri, 18-Nov-2011 17:46:37 GMT; path=/; domain=domain.net
Set-Cookie: nats=MjYwNjk6MTE6MTU%2C0%2C0%2C0%2C0; expires=Sun, 27-Nov-2011 17:46:37 GMT; path=/; domain=domain.net
Set-Cookie: nats_sess=00e48c685c9acbb37fcc3b7461b1ab81; expires=Sat, 25-Feb-2012 17:46:37 GMT; path=/; domain=domain.net
Location: http://www.domain.net/tour/?nats=MjYwNjk6MTE6MTU,0,0,0,0&autocamp=TJ_ABC_VA_A02
Transfer-Encoding: chunked
Content-Type: text/html
HTTP/1.1 200 OK
Date: Thu, 17 Nov 2011 17:46:39 GMT
Server: Apache
Transfer-Encoding: chunked
Content-Type: text/html
As you can see there are two headers with the "Location:" directive.
I'm just wondering why they do this. Wouldn't be enough to include only one header?
The redirecting URLs are even different, So which one is the "real" landing page?
Thanks.
Upvotes: 1
Views: 1777
Reputation: 1206
When CURLOPT_FOLLOWLOCATION and CURLOPT_HEADER are both true and redirect/s have happened, the response returned by curl_exec() will contain all the headers in the redirect chain in the order they were encountered.
Source: http://php.net/manual/en/function.curl-setopt.php#103232
In addition, if a response body is returned anywhere in the redirect chain, it will also be included in the return value of curl_exec().
So you can receive something like:
HEADER 1
HEADER 2
BODY 2
or
HEADER 1
HEADER 2
BODY 2
HEADER 3
BODY 3
Take note of this in case you only want the response header and body from the last redirect. You need to manually strip the headers and bodies from previous redirects.
Upvotes: 1
Reputation: 340
You're looking at three different requests, each of which has its own set of headers. The first URL redirects to the second and the second redirects to a third. Your browser has to download three pages to get the final content of the landing page. Why do they do this? Disregard for the extra latency this adds to the user experience, mainly. Based on the URLs, this is for some kind of user tracking or statistics purpose, and it's likely easier for them to force the browser all over their site than it is to return the content directly.
Upvotes: 0