Reputation: 8982
Is it possible to obtain better performance (both in memory consumption and speed) in this moving-window computation? I have a 1000x1000
numpy array and I take 16x16
windows through the whole array and finally apply some function to each window (in this case, a discrete cosine transform.)
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size,window_size))
dcts = np.zeros(windows.reshape(-1,window_size, window_size).shape, dtype=np.float32)
for idx, window in enumerate(windows.reshape(-1,window_size, window_size)):
dcts[idx, :, :] = dct(window)
dcts = dcts.reshape(windows.shape)
This code takes too much memory (in the example above, the memory consumption is not so bad - windows
uses 1Gb and dcts
also needs 1Gb) and is taking 25 seconds to complete. I'm a bit unsure as to what I'm doing wrong because this should be a straightforward calculation (e.g. filtering an image.) Is there a better way to accomplish this?
UPDATE:
I was initially worried that the arrays produced by Kington's solution and my initial approach were very different, but the difference is restricted to the boundaries, so it is unlikely to cause serious issues for most applications. The only remaining problem is that both solutions are very slow. Currently, the first solution takes 1min 10s and the second solution 59 seconds.
UPDATE 2:
I noticed the biggest culprits by far are dct
and np.mean
. Even generic_filter
performs decently (8.6 seconds) using a "cythonized" version of mean
with bottleneck:
import bottleneck as bp
def func(window, shape):
window = window.reshape(shape)
#return np.abs(dct(dct(window, axis=1), axis=0)).mean()
return bp.nanmean(dct(window))
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))
I'm currently reading how to wrap C code using numpy in order to replace scipy.fftpack.dct
. If anyone knows how to do it, I would appreciate the help.
Upvotes: 4
Views: 964
Reputation:
Since scipy.fftpack.dct
calculates separate transforms along the last axis of the input array, you can replace your loop with:
windows = view_as_windows(X, (window_size,window_size))
dcts = dct(windows)
result1 = dcts.mean(axis=(2,3))
Now only the dcts
array requires a lot of memory and windows
remains merely a view into X
. And because the DCT's are calculated with a single function call it's also much faster. However, because the windows overlap there are lots of repeated calculations. This can be overcome by only calculating the DCT for each sub-row once, followed by a windowed mean:
ws = window_size
row_dcts = dct(view_as_windows(X, (1, ws)))
cs = row_dcts.squeeze().sum(axis=-1).cumsum(axis=0)
result2 = np.vstack((cs[ws-1], cs[ws:]-cs[:-ws])) / ws**2
Though it seems what is gained in effeciency is lost in code clarity... But basically the approach here is to first calculate the DCT's and then take the window average by summing over the 2D window and then dividing by the number of elements in the window. The DCTs are already calculated over rowwise moving windows, so we take a regular sum over those windows. However we need to take a moving window sum over the columns, to arrive at the proper 2D window sums. To do this efficiently we use a cumsum
trick, where:
sum(A[p:q]) # q-p == window_size
Is equivalent to:
cs = cumsum(A)
cs[q-1] - cs[p-1]
This avoids having to sum the exact same numbers over and over. Unfortunately it doesn't work for the first window (when p == 0
), so for that we have to take only cs[q-1]
and stack it together with the other window sums. Finally we divide by the number of elements to arrive at the 2D window average.
If you like to do a 2D DCT than this second approach becomes less interesting, beause you'll eventually need the full 985 x 985 x 16 x 16
array before you can take the mean.
Both approaches above should be equivalent, but it may be a good idea to perform the arithmetic with 64-bit floats:
np.allclose(result1, result2, atol=1e-6)
# False
np.allclose(result1, result2, atol=1e-5)
# True
Upvotes: 3
Reputation: 284712
skimage.util.view_as_windows
is using striding tricks to make an array of overlapping "windows" that doesn't use any additional memory.
However, when you make a new array of the shape shape, it will require ~32 times (16 x 16) the memory that your original X
array or the windows
array used.
Based on your comment, your end result is doing dcts.reshape(windows.shape).mean(axis=2).mean(axis=2)
- taking the mean of the dct
of each window.
Therefore, it would be more memory-efficient (though similar performance wise) to take the mean inside the loop and not store the huge intermediate array of windows:
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size, window_size))
dcts = np.zeros(windows.shape[:2], dtype=np.float32).ravel()
for idx, window in enumerate(windows.reshape(-1, window_size, window_size)):
dcts[idx] = dct(window).mean()
dcts = dcts.reshape(windows.shape[:2])
Another option is scipy.ndimage.generic_filter
. It won't increase performance much (the bottleneck is the python function call in the inner loop), but you'll have a lot more boundary condition options, and it will be fairly memory efficient:
import numpy as np
from scipy.fftpack import dct
import scipy.ndimage
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
def func(window, shape):
window = window.reshape(shape)
return dct(window).mean()
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))
Upvotes: 3