zephyrus
zephyrus

Reputation: 1266

Multiprocessing large convolutions using Scipy no speed up

I am using scipy.signal.correlate to perform large 2D convolutions. I have a large number of arrays that I want to operate on, and so naturally thought the multiprocessing.Poolcould help. However using the following simple setup (on a 4-core cpu) provides no benefit.

import multiprocessing as mp
import numpy as np
from scipy import signal

arrays = [np.ones([500, 500])] * 100
kernel = np.ones([30, 30])

def conv(array, kernel):
  return (array, kernel, signal.correlate(array, kernel, mode="valid", method="fft"))

pool = mp.Pool(processes=4)
results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]

Changing the process count to {1, 2, 3, 4} has approximately the same time (2.6 sec +- .2).

What could be going ?

Upvotes: 2

Views: 1097

Answers (2)

NoDataDumpNoContribution
NoDataDumpNoContribution

Reputation: 10859

What could be going on?

First I thought that the reason is that the underlying FFT implementation is already parallelized. While the Python interpreter is single threaded, the C code that may be called from Python may be able to fully utilize your CPU.

However, the underlying FFT implementation of scipy.correlate seems to be fftpack in NumPy (translated from Fortran that was written in 1985) which seems to be single threaded from what it looks like on the fftpack page.

Indeed, I ran your script and got considerably higher CPU usage. With four cores on my computer and four processes I got a speedup of ~2 (for 2000x2000 arrays and using the code changes from Raw Dawg).

The overhead from creating the processes and communicating with the processes eats up parts of the benefits of more efficient CPU usage.

You can still try to optimize the array sizes or to compute only a small part of the correlation (if you don't need it all) or if the kernel is the same all the time, compute the FFT of the kernel only once and reuse it (this would require implementing part of scipy.correlate yourself) or try single precision instead of double or do the correlation on the graphics card with CUDA.

Upvotes: 2

Sevy
Sevy

Reputation: 698

I think the problem is in this line:

results = [pool.apply(conv, args=(arr, kernel)) for arr in arrays]

Pool.apply is a blocking operation, running conv on each element in the array and waiting before going to the next element, so even though it looks like you are multiprocessing nothing is actually being distributed. You need to use pool.map instead to get the behavior you are looking for:

from functools import partial
conv_partial = partial( conv, kernel=kernel )
results = pool.map( conv_partial, arrays )

Upvotes: 2

Related Questions