user3634426
user3634426

Reputation: 31

python multiprocessing array

Despite all the seemingly similar questions and answers, here goes:

I have a fairly large 2D numpy array and would like to process it row by row using multiprocessing. For each row I need to find specific (numeric) values and use them to set values in a second 2D numpy array. A small example (real use for array with appr. 10000x10000 cells):

import numpy as np
inarray = np.array([(1.5,2,3), (4,5.1,6), (2.7, 4.8, 4.3)])
outarray = np.array([(0.0,0.0,0.0), (0.0,0.0,0.0), (0.0,0.0,0.0)])

I would now like to process inarray row by row using multiprocessing, to find all the cells in inarray that are greater than 5 (e.g. inarray[1,1] and inarray[1,2], and set cells in outarray that have index locations one smaller in both dimensions (e.g. outarray[0,0] and outarray[0,1]) to 1.

After looking here and here and here I'm sad to say I still don't know how to do it. Help!

Upvotes: 3

Views: 1360

Answers (2)

Jaime
Jaime

Reputation: 67427

If you can use the latest numpy development version, then you can use multithreading instead of multiprocessing. Since this PR was merged a couple of months ago, numpy releases the GIL when indexing, so you can do something like:

import numpy as np
import threading

def target(in_, out):
    out[in_ > .5] = 1

def multi_threaded(a, thread_count=3):
    b = np.zeros_like(a)
    chunk = len(a) // thread_count
    threads = []
    for j in xrange(thread_count):
        sl_a = slice(1 + chunk*j,
                     a.shape[0] if j == thread_count-1 else 1 + chunk*(j+1),
                     None)
        sl_b = slice(sl_a.start-1, sl_a.stop-1, None)
        threads.append(threading.Thread(target=target, args=(a[sl_a, 1:],
                                                             b[sl_b, :-1])))
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return b

And now do things like:

In [32]: a = np.random.rand(100, 100000)

In [33]: %timeit multi_threaded(a, 1)
1 loops, best of 3: 121 ms per loop

In [34]: %timeit multi_threaded(a, 2)
10 loops, best of 3: 86.6 ms per loop

In [35]: %timeit multi_threaded(a, 3)
10 loops, best of 3: 79.4 ms per loop

Upvotes: 2

koffein
koffein

Reputation: 1882

I don't think multiprocessing is the right call, because you want to change one object by multiple processes. I think this is not a good idea. I get that it would be nice finding the indexes via multiple processes, but in order to send the data to an other process, the object is internally pickled (again: as far as I know).

Please try this and tell us if it is very slow:

outarray[inarray[1:,1:] > 5] = 1
outarray

array([[ 1.,  1.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

Upvotes: 0

Related Questions