Reputation: 163
Context:
I am doing a robot to read the news block on the first page of google results. I need the results for 200 search queries (totally need to read 200 pages).
To avoid being blocked by google, must wait some time to do the next search (from the same ip). If you wait 30 seconds between each search, reading the 200 pages it will take (200 * 30/60) = 1h40m.
But as news of google results change very fast, I need those 200 pages are accessed almost simultaneously. So reading all 200 pages should take only a few minutes.
If work is divided between 20 proxies (ips), it will take (200/20 * 30/60) = 5m (20 proxies running simultaneously)
I was planning to use pthreads through cli.
Question / Doubt:
Is it possible to run 20 threads simultaneously? Is it advisable to run only a few trheads?
What if I want to run 100 threads (using 100 proxies)?
What other options do I have?
Edit:
I found another option, using php curl_multi or the many libraries written over curl_multi for this purpose. But I think I'll stick to pthreads.
Upvotes: 2
Views: 2405
Reputation: 431
Why don't you just make a single loop, which walks through the proxies ?
This way it's just one process at a time, and also you can filter out dead proxies, and still you can get the desired frequency of updates.
You could do something like this:
$proxies=array('127.0.0.1','192.168.1.1'); // define proxies
$dead=array(); // here you can store which proxies went dead (slow, not responding, up to you)
$works=array('http://google.com/page1','http://google.com/page2'); // define what you want to do
$run=true; $last=0; $looptime=(5*60); // 5 minutes update
$workid=0; $proxyid=0;
while ($run)
{
if ($workid<sizeof($works))
{ // have something to do ...
$work=$works[$workid]; $workid++; $success=0;
while (($success==0)and($proxyid<sizeof($proxies)))
{
if (!in_array($proxyid,$dead))
{
$proxy=$proxies[$proxyid];
$success=launch_the_proxy($work,$proxy);
if ($success==0) {if (!in_array($proxyid,$dead)) {$dead[]=$proxyid;}}
}
$proxyid++;
}
}
else
{ // restart the work sequence once there's no more work to do and loop time is reached
if (($last+$looptime)<time()) {$last=time(); $workid=0; $proxyid=0;}
}
sleep(1);
}
Please note, this is a simple example, you have to work on the details. You must also keep in mind, this one requires at least equal number of proxies compared to the work. (you can tweak this later as you wish, but that needs a more complex way to determine which proxy can be used again)
Upvotes: 0
Reputation: 17168
Some hardware has more than 20 cores, in those cases, it is a no brainer.
Where your hardware has less than 20 cores, it is still not a ridiculous amount of threads, given that the nature of the threads will mean they spend some time blocking waiting for I/O, and a whole lot more time purposefully sleeping so that you don't anger Google.
Ordinarily, when the threading model in use is 1:1, as it is in PHP, it's a good idea in general to schedule about as many threads as you have cores, this is a sensible general rule.
Obviously, the software that started before you (your entire operating system) has likely already scheduled many more threads than you have cores.
The best case scenario still says you can't execute more threads concurrently than you have cores available, which is the reason for the general rule. However, many of the operating systems threads don't actually need to run concurrently, so the authors of those services don't go by the same rules.
Similarly to those threads started by the operating system, you intend to prohibit your threads executing concurrently on purpose, so you can bend the rules too.
TL;DR yes, I think that's okay.
Ordinarily, this might be a bit silly.
But since you plan to force threads to sleep for a long time in between requests, it might be okay here.
You shouldn't normally expect that more threads equates to more throughput. However, in this case, it means you can use more outgoing addresses more easily, sleep for less time overall.
Your operating system has hard limits on the number of threads it will allow you to create, you might well be approaching these limits on normal hardware at 100 threads.
TL;DR in this case, I think that's okay.
If it weren't for the parameters of your operation; that you need to sleep in between requests, and use either specific interfaces or proxies to route requests through multiple addresses, you could use non-blocking I/O quite easily.
Even given the parameters, you could still use non-blocking I/O, but it would make programming the task much more complex than it needs to be.
In my (possibly bias) opinion, you are better off using threads, the solution will be simpler, with less margin for error, and easier to understand when you come back to it in 6 months (when it breaks because Google changed their markup or whatever).
Using proxies may prove to be unreliable and slow, if this is to be a core functionality for some application then consider obtaining enough IP addresses that you can route these requests yourself using specific interfaces. cURL, context options, and sockets, will allow you to set outbound address, this is likely to be much more reliable and faster.
While speed is not necessarily a concern, reliability should be. It is reasonable for a machine to be bound to 20 addresses, it is less reasonable for it to be bound to 100, but if needs must.
Upvotes: 3