Reputation: 5049
I'm trying to wrap my head around the basic concept of applying parallel computing in my python code. I have read many tutorials on IPython parallel; however, I don't seem to fully understand how to elegantly apply it in some basic python code. For example, the demo code bellow in my_script.py:
# imports
import numpy as np
from IPython.parallel import Client
# class definition
class MyClass():
def do_something(self, x, y):
return np.sum(x, y)
# some variables
x = [1, 2, 3, 4, 5, 6, 7, 8 ,9, 10]
y = [1, 2, 3, 4, 5, 6, 7, 8 ,9, 10]
# create client and direct view to all engines available
client = Client()
dview = client[:]
dview.block = True
# here is what i'm currently doing to achieve parallelism
dview.execute('import numpy as np')
dview['MyClass'] = MyClass
dview.scatter('x', x)
dview.scatter('y', y)
dview.execute('my = MyClass()')
dview.execute('z = my.do_something(x, y)')
z = dview.gather('z')
My questions:
Is there a way to include numpy as np once in all namespaces instead of twice here? (once in the first upper import and then second in the execute().
The same as the first question but for MyClass. Is there a more elegant way to include MyClass in all namespaces instead of explicitly pushing the class type as a variable?
How would you approach writing the code above in the most elegant pythonic/ipythonic way?
Upvotes: 0
Views: 215
Reputation: 5531
I find it helpful to think of IPython as taking care of moving the data around between processes, but in general not the code. The notable exceptions are functions (with map()
and apply()
).
Question 1: I don't think its possible to only import numpy once, since your main process and your client processes are not identical. For more readable code you could do something like this:
from IPython.parallel import Client
client = Client() # create client and direct view to all engines available
dview = client[:]
dview.block = True
with dview.sync_imports(): # import on all client processes
import numpy # `import numpy as np` does not work
def f(x , y):
""" return x**2 + y """
return numpy.power(x, 2) + y
x = np.arange(5)
y = 1000 * np.arange(5)
zz = dview.map(f, x, y) # execute f() in parallel on different clients
print(zz) # gives: [0, 1001, 2004, 3009, 4016]
Check out the IPython manual why import numpy as np
will not work.
Question 2: In my opinion, I think the cleanest approach is to put MyClass
in a separate file and import it like numpy
above. The reason being that the parallel processing (MyClass
) and the management (distributing and collecting data) should be separate. Arguably this is also a matter of taste.
Question 3: Your method do_something()
does not work, since np.sum()
sums up the elements of a single array. So I assume that you want to calculate x[0]+y[0]
, x[1]+y[1]
, ...
in parallel. Classes and parallel processes do not play well together, because a main idea of an instance of a class is to have a state (in form of member variables) and stateful functions are very difficult to parallelize. So as a general approach for parallelization, try to use dview.map()
, since it takes care of splitting the arrays into chunks and distributing it to the clients. It also forces the passed functions not to have a local state. If you need to have a local state in each process, use the approach of question 2.
Upvotes: 2