Reputation: 12808

Webpage detecting / displaying different content for curl request - Why?

I need to retrieve and parse the text of public domain books, such as those found on gutenberg.org, with PHP.

To retrieve the content of most webpages I am able to use CURL requests to retrieve the HTML exactly as I would find had I navigated to the URL in a browser.

Unfortunately on some pages, most importantly gutenberg.org pages, the websites display different content or send a redirect header.

For example, when attempting to load this target, gutenberg.org, page a curl request gets redirected to this different but logically related, gutenberg.org, page. I am successfully able to visit the target page with both cookies and javascript turned off on my browser.

Why is the curl request being redirected while a regular browser request to the same site is not?

Here is the code I use to retrieve the webpage:

$urlToScan = "http://www.gutenberg.org/cache/epub/34175/pg34175.txt";

if(!isset($userAgent)){
  $userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36";
}

$ch = curl_init();
$timeout = 15;
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
#curl_setopt($ch, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_URL, $urlToScan);
$html = curl_exec($ch);
curl_close($ch);

if($html == null){
    return false;  
} 
print $html;

Upvotes: 0

Answers (2)

Ulad Kasach

Reputation: 12808

The reason that one could navigate to the target page in a browser without cookies or javascript, yet not by curl, was due to the website tracking the referrer in the header. The page can be loaded without cookies by setting the appropriate referrer header:

curl_setopt($ch, CURLOPT_REFERER, "http://www.gutenberg.org/ebooks/34175?msg=welcome_stranger");

As pointed out by @madshvero, the page also be, surprisingly, loaded by simply excluding the user agent.

Upvotes: 2

Nanne

Reputation: 64399

The hint is probably in the url: it says "welcome stranger". They are redirecting every "first" time visitor to this page. Once you have visited the page, they will not redirect you anymore.

THey don't seem to be saving a lot of stuff in your browser, but they do set a cookie with a session id. This is the most logical thing really: check if there is a session.

What you need to do is connect with curl AND a cookie. You can use your browsers cookie for this, but in case it expires, you'd be better of doing

request the page.
if the page is redirected, safe the cookie (you now have a session)
request the page again with that cookie.

If all goes well, the second request will not redirect. Until the cookie / session expires, and then you start again. see the manual to see how to work with cookies/cookie-jars

Upvotes: 2

Webpage detecting / displaying different content for curl request - Why?

Answers (2)

Related Questions