NewbieUser
NewbieUser

Reputation: 173

PHP - The fastest way to get content of another website and parse this content

I have to get a few params of user's from website. I can do it because every user have an unique ID and I can search users by URL:

http://page.com/search_user.php?uid=X

So I added this URL in for() loop and I tried to get 500 results:

<?php

$start = time();
$results = array();

for($i=0; $i<= 500; $i++)
{
    $c = curl_init();
    curl_setopt($c, CURLOPT_URL, 'http://page.com/search_user.php?uid='.$i);
    curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.2) Gecko/20090729 desktopsmiley_2_2_5643778701369665_44_71 DS_gamingharbor Firefox/3.5.2 (.NET CLR 3.5.30729)');
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    $p = curl_exec($c);
    curl_close($c);

    if ( preg_match('"<span class=\"uname\">(.*?)</span>"si', $p, $matches) )
    {
        $username = $matches[1];
    }
    else
    {
        continue;
    }

    preg_match('"<table cellspacing=\"0\">(.*?)</table>"si', $p, $matches);
    $comments = $matches[1];

    preg_match('"<tr class=\"pos\">(.*?)</tr>"si', $comments, $matches_pos);
    preg_match_all('"<td>([0-9]+)</td>"si', $matches_pos[1], $matches);
    $comments_pos = $matches[1][2];

    preg_match('"<tr class=\"neu\">(.*?)</tr>"si', $comments, $matches_neu);
    preg_match_all('"<td>([0-9]+)</td>"si', $matches_neu[1], $matches);
    $comments_neu = $matches[1][2];

    preg_match('"<tr class=\"neg\">(.*?)</tr>"si', $comments, $matches_neg);
    preg_match_all('"<td>([0-9]+)</td>"si', $matches_neg[1], $matches);
    $comments_neg = $matches[1][2];

    $comments_all = $comments_pos+$comments_neu+$comments_neg;

    $about_me = 0;
    if ( preg_match('"<span>O mnie</span>"si', $p) )
    {
        $about_me = 1;
    }

    $results[] = array('comments' => $comments_all, 'about_me' => $about_me, 'username' => $username);
}

echo 'Generated in: <b>'.(time()-$start).'</b> seconds.<br><br>';
var_dump($results);
?>

Finally I got results: - everything was generated in 135 seconds.

Then I I replaced curl with file_get_contents() and I got: 155 seconds.

Is faster way to get this results than curl ?? I have to get 20.000.000 results from another page and 135 seconds is too much for me.

Thanks.

Upvotes: 3

Views: 1202

Answers (2)

Cups
Cups

Reputation: 6896

Take a look at a previous answer of mine regarding how to divide and conquer this kind of job.

debugging long running PHP script

In your case I'd say you follow the same idea, but you'd further chunk the requests into groups of 500, say.

Upvotes: 0

LihO
LihO

Reputation: 42083

If you really need to query different URLs 500 times, maybe you should consider asynchronous approach. The problem with above is that the slowest part (bottleneck) are the curl requests themselves. While waiting for the response, your code is doing nothing.

Try to have a look at PHP asynchronous cURL with callback (i.e. you would make 500 requests "almost at once" and process responses as they come - asynchronously).

Upvotes: 2

Related Questions