Reputation: 579
I am attempting to do data scraping with php but the url I need to access requires post data.
<?php
//set POST variables
$url = 'https://www.ncaa.org/';
//$url = 'https://web3.ncaa.org/hsportal/exec/hsAction?hsActionSubmit=searchHighSchool';
// This is the data to POST to the form. The KEY of the array is the name of the field. The value is the value posted.
$data_to_post = array();
$data_to_post['hsCode'] = '332680';
$data_to_post['state'] = '';
$data_to_post['city'] = '';
$data_to_post['name'] = '';
$data_to_post['hsActionSubmit'] = 'Search';
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, $data_to_post);
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);
?>
When I tried accessing the second $url where the actual information is hosted it returns failed to load response data, but It will allow me to access the ncaa home page. Is there a reason why I get a failed to load response data even though I am sending the correct form data?
Upvotes: 0
Views: 858
Reputation: 79
curl HTTPS connections needs to turn off specical option. CURLOPT_SSL_VERIFYPEER
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// ** This option MUST BE FALSE **
**curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);**
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, $data_to_post);
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);
Upvotes: 0
Reputation: 781058
The site apparently checks for a recognized user agent. By default PHP curl doesn't send a User-Agent
header. Add
curl_setopt($curl, CURLOPT_USERAGENT, 'curl/7.21.4');
and the script returns a response. However, in this case, the response says that it requires a newer browser than the one you have. So you should copy the user agent string from a real browser, e.g.
curl_setopt($curl, CURLOPT_USERAGENT, '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36');
Also, it requires the parameters to be sent in application/x-www-form-urlencoded
format. When you use an array as the argument to CURLOPT_POSTFIELDS
it uses multipart/form-data
. So change that line to:
curl_setopt($curl,CURLOPT_POSTFIELDS, http_build_query($data_to_post));
to convert the array to a URL-encoded string.
And in the URL, leave out ?hsActionSubmit=searchHighSchool
, as that parameter is sent in the POST fields.
The final, working script looks like this:
<?php
//set POST variables
//$url = 'https://www.ncaa.org/';
$url = 'https://web3.ncaa.org/hsportal/exec/hsAction';
// This is the data to POST to the form. The KEY of the array is the name of the field. The value is the value posted.
$data_to_post = array();
$data_to_post['hsCode'] = '332680';
$data_to_post['state'] = '';
$data_to_post['city'] = '';
$data_to_post['name'] = '';
$data_to_post['hsActionSubmit'] = 'Search';
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, http_build_query($data_to_post));
curl_setopt($curl, CURLOPT_USERAGENT, '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36');
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);
Upvotes: 1