Open url with curl, click on ajax button, wait and get response html

Question

I am going to scrape http://www.car4you.at/Haendlersuche it shows 20 results first time and pagination. I scrape 20 links successfully but facing problem to get link to next page because there is no link in href of pagination. It contains a javascript function.

href="javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')"

My question is how can I load page with curl then click on next page button, wait for response then parse it.

Here is what I am trying

function of curl

function postCurlReq($loginActionUrl,$parameters,$referer)
{
        curl_setopt ($this->curl, CURLOPT_URL,$loginActionUrl); 
        curl_setopt ($this->curl, CURLOPT_POST, 1); 
        curl_setopt ($this->curl, CURLOPT_POSTFIELDS, $parameters); 
        curl_setopt ($this->curl, CURLOPT_COOKIEJAR, realpath('cookie.txt')); // cookie.txt should be in same directoy, where calling script is 
        curl_setopt ($this->curl, CURLOPT_COOKIEFILE, realpath('cookie.txt'));
        curl_setopt ($this->curl, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt ($this->curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0');            
        curl_setopt ($this->curl, CURLOPT_REFERER, $referer);   // set referer
        curl_setopt ($this->curl, CURLOPT_SSL_VERIFYPEER, FALSE);// ssl certificate
        curl_setopt ($this->curl, CURLOPT_SSL_VERIFYHOST, 2);
        $result['EXE'] = curl_exec($this->curl);
        $result['INF'] = curl_getinfo($this->curl);
        $result['ERR'] = curl_error($this->curl);
        return $result;                 
}

and tried code is for pagination

$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("href" => "javascript:AjaxCallback_ResList('ResultList', 'Pager', '1_1874')");
$referer = "http://www.car4you.at/Haendlersuche";

$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);

if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url

print_r($loginHTML['EXE']);

}

second method to scrape is select list that show results like 10 20 50 if my script succeed to select 50 then it ll also be happy work and tried code is for select list

$loginUrl = "http://www.car4you.at/Haendlersuche";
$parameters = array("value" => "50");
$referer = "http://www.car4you.at/Haendlersuche";

$loginHTML = $crawler->postCurlReq($loginUrl,$parameters,$referer);

if ( empty($loginHTML['ERR']) ) { // if no error occure in opening url

print_r($loginHTML['EXE']);

}

gandaliter · Accepted Answer

When scraping a site you aren't running a browser, just picking up the HTML response from the site. This means that you can't just run JavaScript code, you'd have to parse it yourself, or perhaps use a library to parse it for you.

However any AJAX buttons that fetch more results are just calling another URL (perhaps with GET or POST variables), and themselves parsing the result, or sticking it somewhere in the HTML of the page. You can work out what URL calls are being made using Developer Tools in Chrome, or Firebug etc.. Then you can scrape these URLs instead of the original one, to extract the information.

In this particular case it is quite tricky because there are a number of POST variables on the AJAX request, and spotting the pattern isn't trivial, but it is possible, and probably easier than trying to emulate the JavaScript.

In general, if you really really want to simulate the running of JavaScript in scraping, it is possible to run a browser, and interact with it programatically. This is what Selenium does, and I suspect something like this could be done fairly painlessly with Selenium. It's probably still easier to do it by sniffing the AJAX request though.

Open url with curl, click on ajax button, wait and get response html

Answers (1)

Related Questions