Reputation: 479
I am trying to use cURL and PHP to scrape proxies off of a webpage. However, when I use cURL all I get is the CSS in the $content. The page uses wordpress so it dynamically loads content but I haven't found anything to help me download the dynamic content. I use wget in linux and the page downloads fine.
//$source1 = file_get_contents('');
$source1 = get_data("");
$array = array();
$source1 = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $array);
//download webpage
function get_data($url) {
$options = array(
CURLOPT_RETURNTRANSFER => 1, // return web page
CURLOPT_HEADER => true, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20080311 Firefox/", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 50, // stop after 10 redirects
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
My output:
string:203221) HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 06 Feb 2013 22:09:23 GMT
Date: Wed, 06 Feb 2013 22:09:23 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 06 Feb 2013 20:39:30 GMT
ETag: "c6675d47-80ec-48ee-9c0f-613c9419f172"
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 47132
Server: GSE
Upvotes: 0
Views: 5110
Reputation: 2887
It seems to me like either a timeout or a problem with your Regexp.
Why not stick to file_get_contents
like you tried in the first place?
$content = file_get_contents('');
preg_match_all('/(\d+\.\d+\.\d+\.\d+(:\d+)?)/', $content, $matches);
This will print out a list of IPs:
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
Hope that helps.
Upvotes: 1
Reputation: 3273
Couple things:
statement is incorrect. You're searching a blank array and saving the result to the variable you just pulled your data into. Try:$array = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $source1);
When I run the code and dump $source1['content']
directly after the get_data
call, I get a crap-ton of IP addresses ...
Upvotes: 2
Reputation: 19563
Curl wont be able to get it directly since it wont execute javascript. But if its coming from an ajax request, you can make a request to that endpoint directly.
Use dev tools/firebug to see what is happening.
Upvotes: 3