Reputation: 31
Despite all the seemingly similar questions and answers, here goes:
I have a fairly large 2D numpy array and would like to process it row by row using multiprocessing. For each row I need to find specific (numeric) values and use them to set values in a second 2D numpy array. A small example (real use for array with appr. 10000x10000 cells):
import numpy as np
inarray = np.array([(1.5,2,3), (4,5.1,6), (2.7, 4.8, 4.3)])
outarray = np.array([(0.0,0.0,0.0), (0.0,0.0,0.0), (0.0,0.0,0.0)])
I would now like to process inarray row by row using multiprocessing, to find all the cells in inarray that are greater than 5 (e.g. inarray[1,1] and inarray[1,2], and set cells in outarray that have index locations one smaller in both dimensions (e.g. outarray[0,0] and outarray[0,1]) to 1.
After looking here and here and here I'm sad to say I still don't know how to do it. Help!
Upvotes: 3
Views: 1360
Reputation: 67427
If you can use the latest numpy development version, then you can use multithreading instead of multiprocessing. Since this PR was merged a couple of months ago, numpy releases the GIL when indexing, so you can do something like:
import numpy as np
import threading
def target(in_, out):
out[in_ > .5] = 1
def multi_threaded(a, thread_count=3):
b = np.zeros_like(a)
chunk = len(a) // thread_count
threads = []
for j in xrange(thread_count):
sl_a = slice(1 + chunk*j,
a.shape[0] if j == thread_count-1 else 1 + chunk*(j+1),
None)
sl_b = slice(sl_a.start-1, sl_a.stop-1, None)
threads.append(threading.Thread(target=target, args=(a[sl_a, 1:],
b[sl_b, :-1])))
for t in threads:
t.start()
for t in threads:
t.join()
return b
And now do things like:
In [32]: a = np.random.rand(100, 100000)
In [33]: %timeit multi_threaded(a, 1)
1 loops, best of 3: 121 ms per loop
In [34]: %timeit multi_threaded(a, 2)
10 loops, best of 3: 86.6 ms per loop
In [35]: %timeit multi_threaded(a, 3)
10 loops, best of 3: 79.4 ms per loop
Upvotes: 2
Reputation: 1882
I don't think multiprocessing is the right call, because you want to change one object by multiple processes. I think this is not a good idea. I get that it would be nice finding the indexes via multiple processes, but in order to send the data to an other process, the object is internally pickled (again: as far as I know).
Please try this and tell us if it is very slow:
outarray[inarray[1:,1:] > 5] = 1
outarray
array([[ 1., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Upvotes: 0