Running multiple processes in parallel in php

Question

Context:

I am doing a robot to read the news block on the first page of google results. I need the results for 200 search queries (totally need to read 200 pages).
To avoid being blocked by google, must wait some time to do the next search (from the same ip). If you wait 30 seconds between each search, reading the 200 pages it will take (200 * 30/60) = 1h40m.
But as news of google results change very fast, I need those 200 pages are accessed almost simultaneously. So reading all 200 pages should take only a few minutes.
If work is divided between 20 proxies (ips), it will take (200/20 * 30/60) = 5m (20 proxies running simultaneously)
I was planning to use pthreads through cli.

Question / Doubt:

Is it possible to run 20 threads simultaneously? Is it advisable to run only a few trheads?
What if I want to run 100 threads (using 100 proxies)?
What other options do I have?

Edit:

I found another option, using php curl_multi or the many libraries written over curl_multi for this purpose. But I think I'll stick to pthreads.

Joe Watkins · Accepted Answer

Is it possible to run 20 threads simultaneously?

Some hardware has more than 20 cores, in those cases, it is a no brainer.

Where your hardware has less than 20 cores, it is still not a ridiculous amount of threads, given that the nature of the threads will mean they spend some time blocking waiting for I/O, and a whole lot more time purposefully sleeping so that you don't anger Google.

Ordinarily, when the threading model in use is 1:1, as it is in PHP, it's a good idea in general to schedule about as many threads as you have cores, this is a sensible general rule.

Obviously, the software that started before you (your entire operating system) has likely already scheduled many more threads than you have cores.

The best case scenario still says you can't execute more threads concurrently than you have cores available, which is the reason for the general rule. However, many of the operating systems threads don't actually need to run concurrently, so the authors of those services don't go by the same rules.

Similarly to those threads started by the operating system, you intend to prohibit your threads executing concurrently on purpose, so you can bend the rules too.

TL;DR yes, I think that's okay.

What if I want to run 100 threads ?

Ordinarily, this might be a bit silly.

But since you plan to force threads to sleep for a long time in between requests, it might be okay here.

You shouldn't normally expect that more threads equates to more throughput. However, in this case, it means you can use more outgoing addresses more easily, sleep for less time overall.

Your operating system has hard limits on the number of threads it will allow you to create, you might well be approaching these limits on normal hardware at 100 threads.

TL;DR in this case, I think that's okay.

What other options do I have?

If it weren't for the parameters of your operation; that you need to sleep in between requests, and use either specific interfaces or proxies to route requests through multiple addresses, you could use non-blocking I/O quite easily.

Even given the parameters, you could still use non-blocking I/O, but it would make programming the task much more complex than it needs to be.

In my (possibly bias) opinion, you are better off using threads, the solution will be simpler, with less margin for error, and easier to understand when you come back to it in 6 months (when it breaks because Google changed their markup or whatever).

Alternative to using proxies

Using proxies may prove to be unreliable and slow, if this is to be a core functionality for some application then consider obtaining enough IP addresses that you can route these requests yourself using specific interfaces. cURL, context options, and sockets, will allow you to set outbound address, this is likely to be much more reliable and faster.

While speed is not necessarily a concern, reliability should be. It is reasonable for a machine to be bound to 20 addresses, it is less reasonable for it to be bound to 100, but if needs must.

Running multiple processes in parallel in php

Answers (2)

Is it possible to run 20 threads simultaneously?

What if I want to run 100 threads ?

What other options do I have?

Alternative to using proxies

Related Questions