Reputation: 1667
I am going to scrape http://www.car4you.at/Haendlersuche it shows 20 results first time and pagination. I scrape 20 links successfully but facing problem to get link to next page because there is no link in href of pagination. It contains a javascript function.
href="javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')"
My question is how can I load page with curl then click on next page button, wait for response then parse it.
Here is what I am trying
function of curl
function postCurlReq($loginActionUrl,$parameters,$referer)
{
curl_setopt ($this->curl, CURLOPT_URL,$loginActionUrl);
curl_setopt ($this->curl, CURLOPT_POST, 1);
curl_setopt ($this->curl, CURLOPT_POSTFIELDS, $parameters);
curl_setopt ($this->curl, CURLOPT_COOKIEJAR, realpath('cookie.txt')); // cookie.txt should be in same directoy, where calling script is
curl_setopt ($this->curl, CURLOPT_COOKIEFILE, realpath('cookie.txt'));
curl_setopt ($this->curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($this->curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0');
curl_setopt ($this->curl, CURLOPT_REFERER, $referer); // set referer
curl_setopt ($this->curl, CURLOPT_SSL_VERIFYPEER, FALSE);// ssl certificate
curl_setopt ($this->curl, CURLOPT_SSL_VERIFYHOST, 2);
$result['EXE'] = curl_exec($this->curl);
$result['INF'] = curl_getinfo($this->curl);
$result['ERR'] = curl_error($this->curl);
return $result;
}
and tried code is for pagination
$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("href" => "javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')");
$referer = "http://www.car4you.at/Haendlersuche";
$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);
if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url
print_r($loginHTML['EXE']);
}
second method to scrape is select list that show results like 10 20 50 if my script succeed to select 50 then it ll also be happy work and tried code is for select list
$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("value" => "50");
$referer = "http://www.car4you.at/Haendlersuche";
$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);
if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url
print_r($loginHTML['EXE']);
}
Upvotes: 3
Views: 5482
Reputation: 10101
When scraping a site you aren't running a browser, just picking up the HTML response from the site. This means that you can't just run JavaScript code, you'd have to parse it yourself, or perhaps use a library to parse it for you.
However any AJAX buttons that fetch more results are just calling another URL (perhaps with GET or POST variables), and themselves parsing the result, or sticking it somewhere in the HTML of the page. You can work out what URL calls are being made using Developer Tools in Chrome, or Firebug etc.. Then you can scrape these URLs instead of the original one, to extract the information.
In this particular case it is quite tricky because there are a number of POST variables on the AJAX request, and spotting the pattern isn't trivial, but it is possible, and probably easier than trying to emulate the JavaScript.
In general, if you really really want to simulate the running of JavaScript in scraping, it is possible to run a browser, and interact with it programatically. This is what Selenium does, and I suspect something like this could be done fairly painlessly with Selenium. It's probably still easier to do it by sniffing the AJAX request though.
Upvotes: 2