Reputation: 11
I'm trying to understand why my load balanced view map statement takes 22 secs to execute on 2 cores rather than just 10 ms on one core using built-in map. I understand that parallel work has an overhead associated with it, but that can't explain the extra 22 seconds. What am I doing wrong?
I am running Python 2.7 on an Intel Core2Duo Mac. OS X.10.
In [4]: from IPython.parallel import Client
In [5]: rc = Client()
In [6]: lview = rc.load_balanced_view()
In [7]: lview.block = True
In [8]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 9.91 ms per loop
In [9]: %timeit lview.map(lambda x:x**10, range(3000))
1 loops, best of 3: 22.8 s per loop
Upvotes: 1
Views: 414
Reputation: 9890
As univerio noted, there's a considerable amount of overhead. Tests of IPython.parallel using very fast tasks will give poor performance. Your tasks take almost no time to complete, and are simpler than even simple overhead. If each task takes one second to complete, on the other hand, IPython.parallel would be more useful. Keep in mind that the system is designed not just for distributing tasks across multiple cores, but also across multiple computers that may have very different environments, are not running pre-shared code, and don't necessarily have shared memory or disks. I have in the past had a controller distributing tasks to 300 cpus on a number of computers in different cities running different Python versions and different operating systems. All of this requires quite a bit of overhead. When you send a task, you're sending the code and data needed for it, for example.
Another issue, however, is that IPython's parallel system needs to be configured for the sorts of tasks you are giving it. In particular, the High Water Mark (HWM) setting in the ipcontroller configuration has a significant impact on performance for smaller tasks. By default, HWM is set to 1, which means that the controller sends one task to each ipengine worker, and doesn't send a new task to that worker until the first task is returned to it. This does the best load balancing, as it means that if tasks take different amounts of time, then workers will get a new task each time they finish the one they are working on, and faster workers will get more tasks. In some cases this can be very slow.
If your tasks are fast, however, it means that there is far more overhead. In these cases, setting HWM to something higher can be useful. HWM is essentially a setting of how many tasks are allowed to be outstanding on an engine. Set it to 10, and the controller will send out 10 tasks to each engine, and then send new ones (singly) as the engines drop below 10 outstanding tasks.
A particularly useful setting for a large number of very fast tasks is the special setting of 0. In this case, the controller distributes all the tasks to the workers at one time, and then waits for them to return.
This setting is c.TaskScheduler.hwm in ipcontroller_config.py.
Upvotes: 1
Reputation: 20538
There's simply a lot of overhead. You have to send jobs via the message queue to workers for every loop you do. If you distributed your jobs smarter, it would be much more efficient (but still not quite as efficient as the single-threaded version):
In [7]: %timeit map(lambda x:x**10, range(3000))
100 loops, best of 3: 3.17 ms per loop
In [8]: %timeit lview.map(lambda i:[x**10 for x in range(i * 500)], range(6)) # I'm using 6 cores
100 loops, best of 3: 11.4 ms per loop
In [9]: %timeit lview.map(lambda i:[x**10 for x in range(i * 1500)], range(2))
100 loops, best of 3: 5.76 ms per loop
If your workload gets big enough, parallelization pays off:
In [10]: %timeit lview.map(lambda i:len([x**10 for x in range(i * 500000)]), range(6))
1 loops, best of 3: 2.86 s per loop
In [11]: %timeit map(lambda x:x**10, range(3000000))
1 loops, best of 3: 3.41 s per loop
Upvotes: 2