Reputation: 10348
I am attempting to write a script that can retrieve the HTML from my school's schedule search webpage. I am able to visit the web page normally when I visit it using a browser, but when I try to get it to work using cURL, it gets the HTML from the redirected page. When I changed the
CURLOPT_FOLLOWLOCATION
variable from true to false, it only outputs a blank page with the headers sent.
For reference, my PHP code is
<?php
$curl_connection = curl_init('https://www.registrar.usf.edu/ssearch/');
curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($curl_connection, CURLOPT_HEADER, true);
curl_setopt($curl_connection, CURLOPT_REFERER, "https://www.registrar.usf.edu/");
$result = curl_exec($curl_connection);
print $result;
?>
The website that I am trying to get the HTML of from cURL is https://www.registrar.usf.edu/ssearch/ or https://www.registrar.usf.edu/ssearch/search.php
Any ideas?
Upvotes: 0
Views: 1489
Reputation: 1513
I added 2 lines more, which now saves cookies which decides whether to redirect you when you try scraping the shedule's page.
$curl_connection = curl_init();
$url = "https://www.registrar.usf.edu/ssearch/search.php";
curl_setopt($curl_connection, CURLOPT_URL, $url);
curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($curl_connection, CURLOPT_COOKIEJAR, 'cookie.txt');//cookiejar to dump cookie infos.
curl_setopt ($curl_connection, CURLOPT_COOKIEFILE, 'cookie.txt');//cookie file for further reference from the site
curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl_connection, CURLOPT_HEADER, true);
curl_setopt($curl_connection, CURLOPT_REFERER, "https://www.registrar.usf.edu/");
$result = curl_exec($curl_connection);
echo $result;
Also, I havent seen anyone putting urls in curl_init
yet.
Here is the cookie :
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
www.registrar.usf.edu FALSE / FALSE 0 PHPSESSID eied78t0v1qlqcop0rdk214361
www.registrar.usf.edu FALSE /ssearch/ FALSE 1336718465 cookie_test cookie_set
If you ever wanna debug a non working curl stuff, start with var_dump(curl_getinfo($curl_connection));
and next one to check is curl_error($curl_connection);
Upvotes: 3