Fireflight
Fireflight

Reputation: 3011

PHP Web scraping tutorial is failing

I'm trying to wrap my mind around some PHP web scraping using cURL. I recently picked up a short book on the topic, but am stuck on one of the tutorials and can't seem to find where the error is. The cookie.txt file is created, so I know that at least one portion of the function is executing properly.

I've tried using both the id and name attributes of the name and password input fields without any luck. As far as I can tell, I'm also using the correct POST url.

<?php 

// Function to submit form using cURL POST method
function curlPost($postUrl, $postFields, $successString) {

  $useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'; // Setting using agent of a very old, yet popular browser.

  $cookie = 'cookie.txt'; //Setting a cookie file to store cookie

  $ch = curl_init(); // Intializing cURL session

  // Setting cURL options
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // Prevent cURL from verifying SSL certificate
  curl_setopt($ch, CURLOPT_FAILONERROR, TRUE); // Script should fail silently on error
  curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE); // Use cookies
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow Location: headers
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Reutrning transfer as a string
  curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie); // Setting cookiefile
  curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie); // Setting cookiejar
  curl_setopt($ch, CURLOPT_USERAGENT, $useragent); // Setting useragent
  curl_setopt($ch, CURLOPT_URL, $postUrl); // Setting URL to POST

  curl_setopt($ch, CURLOPT_POST, TRUE); // Setting method as POST
  curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postFields)); // Setting POST fields as array

  $results = curl_exec($ch); // Executing cURL session
  curl_close($ch); // Closing cURL session

  // Checking if login was successful by checking existence of string
  if (strpos($results, $successString)) {
    return $results;
  } else {
    return FALSE;
  } 
}

$userEmail = '[email protected]'; // Setting your email address for site login
$userPass = 'password'; // Setting your password for site login

$postUrl = 'https://www.packtpub.com/'; // Setting URL to POST to

// Setting form input fields as 'name' => 'value'
$postFields = array (
'name' => $userEmail,
'password' => $userPass,
'form_id' => 'packt-login-form-header'
);

$successString = 'You are logged in as';

$loggedIn = curlPost($postUrl, $postFields, $successString); // Executing curlPost login and storing results page in $loggedIn

?>

Upvotes: 0

Views: 267

Answers (1)

theafh
theafh

Reputation: 478

I've tested the script under Linux and it works as expected, with two minimal corrections:

First as hindmost mentioned, the path for the coockie-file has to be absolute. You can either provide the full path or use something like this:

$cookie = dirname(__FILE__).'/cookie.txt';

OR

$cookie = __DIR__.'/cookie.txt'; // if PHP Version > 5.3.0

This will insert the directory dynamically from the path of your file in which the function is declared.

Second you have to do “something” with the content of the $loggedIn variable to see any effect and for further debugging. You could for example use this code at the end of your script:

var_dump($loggedIn);

This will echo “bool(false)” on ERROR or the content of the request as in the variable $results from that function.

Upvotes: 1

Related Questions