Sam
Sam

Reputation: 15506

How to limit the amount of connections per second made to server in GET / PHP5?

Given the script below that fetches words in another language and makes a connection with the server.

But, there are so many separate string entities that some of them return as empty values. StackOverflow fellow @Pekka correctly assessed this to the limitation of Google: Timing Out the result.

Q1.How can I make the connection more strong/reliable, albeit at the cost of speed?
Q2. How can I deliberatly limit the amount of connections made per second to the server?

I am willing to sacrifice speed even its will cause a 120 second delay) as long as the returned values are correct. Now everything starts and finished in 0.5 second orso, with various gaps in the translation. Almost like Dutch Cheese ( with holes) and I want cheese withuot holes, even if that means longer waiting times.

As you can see my own solution of putting the script to sleep for 1/4 of a second cannot be called elegant... How to proceed from here?

    $url='http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=' . rawurlencode($string) . '&langpair=' . rawurlencode($from.'|'.$to);
    $response   = file_get_contents($url,
            null,
            stream_context_create(
            array(
            'http'=>array(
            'method'=>"GET",
            'header'=>"Referer: http://test.com/\r\n"
            )
            )
        ));
usleep(250000); # means 1/4 of second deliberate pauze
return self::cleanText($response);
}

Upvotes: 2

Views: 2158

Answers (3)

Alix Axel
Alix Axel

Reputation: 154553

Here is a snippet of the code I use on my CURL wrapper, the delay increases exponentially - which is a good thing, otherwise you might end up just stressing the server and never getting a positive response:

function CURL($url)
{
    $result = false;

    if (extension_loaded('curl'))
    {
        $curl = curl_init($url);

        if (is_resource($curl))
        {
            curl_setopt($curl, CURLOPT_FAILONERROR, true);
            curl_setopt($curl, CURLOPT_AUTOREFERER, true);
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

            for ($i = 1; $i <= 8; ++$i)
            {
                $result = curl_exec($curl);

                if (($i == 8) || ($result !== false))
                {
                    break;
                }

                usleep(pow(2, $i - 2) * 1000000);
            }

            curl_close($curl);
        }
    }

    return $result;
}

The $i variable here has a max value of 8, which means the function will try to fetch the URL 8 times on total with a respective delay of 0.5, 1, 2, 4, 8, 16, 32 and 64 seconds (or 127.5 seconds overall).

As for concurrent processes, I recommend setting a shared memory variable with APC or similar.

Hope it helps!

Upvotes: 1

Charles
Charles

Reputation: 51411

How can I deliberatly limit the amount of connections made per second to the server?

It depends. In an ideal world, if you're expecting any level of traffic whatsoever, you'd probably want your scraper to be a daemon that you communicate with through a message or work queue. In that case, the daemon would be able to keep tight control of the requests per second and throttle things appropriately.

It sounds like you're actually doing this live, on a user request. To be honest, your current sleeping strategy is just fine. Sure, it's "crude", but it's simple and it works. The trouble comes when you might have more than one user making a request at the same time, in which case the two requests would be ignorant of the other, and you'll end up with more requests per second than the service will permit.

There are a few strategies here. If the URL never changes, that is, you're only throttling a single service, you basically need a semaphore to coordinate multiple scripts.

Consider using a simple lock file. Or, more precisely, a file lock on a lock file:

// Open our lock file for reading and writing; 
// create it if it doesn't exist, 
// don't truncate, 
// don't relocate the file pointer.
$fh = fopen('./lock.file', 'c+');
foreach($list_of_requests as $request_or_whatever) {
// At the top of the loop, establish the lock.
    $ok = flock($fh, LOCK_EX):
    if(!$ok) {
        echo "Wow, the lock failed, that shouldn't ever happen.";
        break; // Exit the loop.
    }
// Insert the actual request *and* sleep code here.
    $foo->getTranslation(...);
// Once the request is made and we've slept, release the lock
// to allow another process that might be waiting for the lock
// to grab it and run.
   flock($fh, LOCK_UN);
}
fclose($fh);

This will work well in most cases. If you're on super-low-cost or low-quality shared hosting, locks can backfire because of how the underlying filesystem (doesn't) work. flock is also a bit finicky on Windows.

If you will deal with multiple services, things get a bit more sticky. My first instinct would be creating a table in the database and begin keeping track of each request made, and adding additional throttling if more than X requests have been made to domain Y in the past Z seconds.

Q1.How can I make the connection more strong/reliable, albeit at the cost of speed?

If you're sticking with Google Translate, you might want to switch to the Translate v2 RESTful API. This requires an API key, but the process of signing up will force you to go through their TOS, which should document their requests/period maximum limit. From this, you can make your system throttle requests to whatever rate their service will support and maintain reliability.

Upvotes: 1

declan
declan

Reputation: 5635

You could start with a low wait time and only increase it if you are failing to get a response. Something like this.

$delay = 0;
$i = 0;
$nStrings = count($strings);

while ($i < $nStrings) {
    $response = $this->getTranslation($strings[$i]);
    if ($response) {
        $i++;
        # save the response somewhere
    else {
        $delay += 1000;
        usleep($delay);
    }
}

Upvotes: 1

Related Questions