Reputation: 195
I'm trying to parallelize the following operation with cupy: I have an array. For each column of that array, I'm generating 2 random vectors. I take that array column, add one of the vectors, subtract the other, and make that new vector the next column of the array. I continue on until I finish with the array.
I already asked the following question - Cupy slower than numpy when iterating through array. But this is different, in that I believe I followed the advice of parallelizing the operation and having one "for loop" instead of two, and iterating only through the array columns instead of both rows and columns.
import cupy as cp
import time
#import numpy as cp
def row_size(array):
return(array.shape[1])
def number_of_rows(array):
return(array.shape[0])
x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))
x[:,1] = 500000
vector_one = x * 0
vector_two = x * 0
start = time.time()
for i in range(number_of_rows(x) - 1):
if sum(x[ :, i])!=0:
vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]
time = time.time() - start
print(x)
print(time)
When I run this in cupy, the time comes out to about .62 seconds.
When I switch to numpy, so I 1) uncomment #import numpy as cp and #x = cp.zeros((200,200)) and 2) instead comment import cupy as cp and x = (cp.zeros((200,200), 'f')):
The time comes out to about .11 seconds.
I thought maybe if I increase the array size, for example from (200,200) to (2000,2000), then I'd see a difference in cupy being faster, but it's still slower.
I know this is working properly, in a sense, because if I change the coefficient in cp.random.poisson from .01 to .5, I can only do that in cupy because that lambda is too large for numpy.
But still, how do I make it actually faster with cupy?
Upvotes: 2
Views: 3965
Reputation: 4214
In general, looping on the host (CPU) and iteratively processing small device (GPU) arrays isn't ideal due to the larger number of separate kernels you will have to launch than in a columnar-oriented approach. However, sometimes a columnar-oriented approach just isn't feasible.
You can speed up your CuPy code by using CuPy's sum
instead of using Python's built-in sum
operation, which is forcing a device to host transfer each time you call it. With that said, you can also speed up your NumPy code by switching to NumPy's sum.
import cupy as cp
import time
#import numpy as cp
def row_size(array):
return(array.shape[1])
def number_of_rows(array):
return(array.shape[0])
x = (cp.zeros((200,200), 'f'))
#x = cp.zeros((200,200))
x[:,1] = 500000
vector_one = x * 0
vector_two = x * 0
start = time.time()
for i in range(number_of_rows(x) - 1):
# if sum(x[ :, i]) !=0:
if x[ :, i].sum() !=0: # or you could do: if x[ :, i].sum().get() !=0:
vector_one[ :, i + 1], vector_two[ :, i+ 1] = cp.random.poisson(.01*x[:,i],len(x[:,i])), cp.random.poisson(.01 * x[:,i],len(x[:,i]))
x[ :, i+ 1] = x[ :, i] + vector_one[ :, i+ 1] - vector_two[ :, i+ 1]
cp.cuda.Device().synchronize() # CuPy is asynchronous, but this doesn't really affect the timing here.
t = time.time() - start
print(x)
print(t)
[[ 0. 500000. 500101. ... 498121. 497922. 497740.]
[ 0. 500000. 499894. ... 502050. 502174. 502112.]
[ 0. 500000. 499989. ... 501703. 501836. 502081.]
...
[ 0. 500000. 499804. ... 499600. 499526. 499371.]
[ 0. 500000. 499923. ... 500371. 500184. 500247.]
[ 0. 500000. 500007. ... 501172. 501113. 501254.]]
0.06389498710632324
This small change should make your workflow much faster (0.06 vs 0.6 seconds originally on my T4 GPU). Note that the .get()
method in the comment is used to explicitly transfer the result of the sum
operation from the GPU to the CPU before the not equal comparison. This isn't necessary, as CuPy knows how to handle logical operations, but would give you a very tiny additional speedup.
Upvotes: 2