Ipulatov
Ipulatov

Reputation: 195

Cupy slower than numpy when doing a "for loop" for columns of an array as vectors

I'm trying to parallelize the following operation with cupy: I have an array. For each column of that array, I'm generating 2 random vectors. I take that array column, add one of the vectors, subtract the other, and make that new vector the next column of the array. I continue on until I finish with the array.

I already asked the following question - Cupy slower than numpy when iterating through array. But this is different, in that I believe I followed the advice of parallelizing the operation and having one "for loop" instead of two, and iterating only through the array columns instead of both rows and columns.

import cupy as cp
import time
#import numpy as cp


def row_size(array):
    return(array.shape[1])

def number_of_rows(array):
    return(array.shape[0])

x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))

x[:,1] = 500000

vector_one = x * 0
vector_two = x * 0

start = time.time()
for i in range(number_of_rows(x) - 1):
    if sum(x[ :, i])!=0:
        vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
        x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]

 time = time.time() - start      
 print(x)
 print(time)

When I run this in cupy, the time comes out to about .62 seconds.

When I switch to numpy, so I 1) uncomment #import numpy as cp and #x = cp.zeros((200,200)) and 2) instead comment import cupy as cp and x = (cp.zeros((200,200), 'f')):

The time comes out to about .11 seconds.

I thought maybe if I increase the array size, for example from (200,200) to (2000,2000), then I'd see a difference in cupy being faster, but it's still slower.

I know this is working properly, in a sense, because if I change the coefficient in cp.random.poisson from .01 to .5, I can only do that in cupy because that lambda is too large for numpy.

But still, how do I make it actually faster with cupy?

Upvotes: 2

Views: 3965

Answers (1)

Nick Becker
Nick Becker

Reputation: 4214

In general, looping on the host (CPU) and iteratively processing small device (GPU) arrays isn't ideal due to the larger number of separate kernels you will have to launch than in a columnar-oriented approach. However, sometimes a columnar-oriented approach just isn't feasible.

You can speed up your CuPy code by using CuPy's sum instead of using Python's built-in sum operation, which is forcing a device to host transfer each time you call it. With that said, you can also speed up your NumPy code by switching to NumPy's sum.

import cupy as cp
import time
#import numpy as cp


def row_size(array):
    return(array.shape[1])

def number_of_rows(array):
    return(array.shape[0])

x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))

x[:,1] = 500000

vector_one = x * 0
vector_two = x * 0

start = time.time()
for i in range(number_of_rows(x) - 1):
#     if sum(x[ :, i]) !=0:
    if x[ :, i].sum() !=0: # or you could do: if x[ :, i].sum().get() !=0:
        vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
        x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]

cp.cuda.Device().synchronize() # CuPy is asynchronous, but this doesn't really affect the timing here.

t = time.time() - start      
print(x)
print(t)
[[     0. 500000. 500101. ... 498121. 497922. 497740.]
 [     0. 500000. 499894. ... 502050. 502174. 502112.]
 [     0. 500000. 499989. ... 501703. 501836. 502081.]
 ...
 [     0. 500000. 499804. ... 499600. 499526. 499371.]
 [     0. 500000. 499923. ... 500371. 500184. 500247.]
 [     0. 500000. 500007. ... 501172. 501113. 501254.]]
0.06389498710632324

This small change should make your workflow much faster (0.06 vs 0.6 seconds originally on my T4 GPU). Note that the .get() method in the comment is used to explicitly transfer the result of the sum operation from the GPU to the CPU before the not equal comparison. This isn't necessary, as CuPy knows how to handle logical operations, but would give you a very tiny additional speedup.

Upvotes: 2

Related Questions