Why is my vectorized Numpy code taking longer than the non-vectorized code

Question

So I am calculating Poisson distributions using large amounts of data. I have an array of shape (2666667,19) - "spikes", and an array of shape (19,100) - "placefields". I used to have a for loop that iterated through the 2666667 dimension, which took around 60 seconds to complete. Then, I learned that If I vectorize for loops, it becomes much faster, so I tried to do so. The vectorized form works and outputs the same results, however, now it takes 120 seconds :/

Here is the original loop (60s):

def compute_probability(spikes,placefields):
    nTimeBins = len(spikes[0])
    probability = np.empty((nTimeBins, 99)) #empty probability matrix
    for i in range(nTimeBins):
        nspikes = np.tile(spikes[:,i],(99))
        nspikes = np.swapaxes(nspikes,0,1)
        maxL = stats.poisson.pmf(nspikes,placefields)
        maxL = maxL.prod(axis=0)
        probability[i,:] = maxL
    return probability

And here is the vectorised form (120s)

def compute_probability(spikes,placefields):  

    placefields = np.reshape(placefields,(19,99,1))
    #prepared placefields

    nspikes = np.tile(spikes, (99,1,1))
    nspikes = np.swapaxes(nspikes,0,1)
    #prepared nspikes

    probability = stats.poisson.pmf(nspikes,placefields)
    probability = np.swapaxes(probability.prod(axis=0),0,1)
    return probability

Why is it SO SLOW. I think it might be that the tiled arrays created by the vectorized form are so gigantic they take up a huge amount of memory. How can I make it go faster? download samplespikes and sampleplacefields (as suggested by the comments)- https://mega.nz/file/lpRF1IKI#YHq1HtkZ9EzYvaUdlrMtBwMg-0KEwmhFMYswxpaozXc

EDIT: The issue was that although it was vectorized, the huge array was taking up too much RAM. I have split the calculation into chunks, and it does better now:

placefields = np.reshape(placefields,(len(placefields),99,1))
nspikes = np.swapaxes(np.tile(spikes, (xybins,1,1)),0,1)
probability = np.empty((len(spikes[0]), xybins))

chunks = len(spikes[0])//20
n = int(len(spikes[0])/chunks)
for i in range(0,len(nspikes[0][0]),n):
    nspikes_chunk = nspikes[:,:,i:i+n]
    probability_chunk = stats.poisson.pmf(nspikes_chunk,placefields)
    probability_chunk = np.swapaxes(probability_chunk.prod(axis=0),0,1)
    if len(probability_chunk)<(len(spikes)//chunks):
        probability[i:] = probability_chunk
    else:
        probability[i:i+len(probability_chunk)] = probability_chunk

J&#233;r&#244;me Richard · Accepted Answer

This is likely due to memory/cache effects.

The first code work on small arrays fitting in the CPU caches. It is not great because each Numpy function call take some time. The second code fix that issue. However, it allocate/fill huge arrays in memory of several GiB. It is much faster to work in CPU caches than in main memory (RAM). This is especially true when the working arrays are used only once (because of expensive OS page-faults) which seems to be the case in your code. If you do not have enough memory, the OS will read/write temporary data in SSD/HDD storage devices that are very slow compared to the RAM and the CPU caches.

The best solution is probably to work on chunks so that the operation is both vectorized (reducing the overhead of the Numpy function calls) and fit in CPU caches (reducing the cost of RAM reads/writes). Note that the size of the last level cache is typically few MiB nowadays on mainstream PC processors.

The takeaway message is that vectorization do not always make things faster. For better performance, one should care about the size of the manipulated data chunks so they fit in CPU caches.

PS: note that if you do not care too much about precision, you can use simple-precision (np.float32) instead of double-precision (np.float64) to speed up a bit the computation.

Why is my vectorized Numpy code taking longer than the non-vectorized code

Answers (1)

Related Questions