Reputation: 41
I am playing around with the differences between numpy and cupy and have noticed that within these two similiar programs I have created, the cupy version is much slower despite the fact that is runs on a GPU.
Here is the numpy version:
import time
import numpy as np
size = 5000
upperBound = 20
dataSet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
dataLength = np.random.randint(0, high=upperBound, size=size, dtype='l')
randomNumber = np.random.randint(0, high=62, size=size * upperBound, dtype='l')
count = 0
dataCount = 0
start_time = time.time()
for i in range(size):
lineData = ""
for j in range(dataLength[i]):
lineData = lineData + dataSet[randomNumber[count]]
count = count + 1
print(lineData)
dataCount = dataCount + 1
time = str(time.time() - start_time)
print("------------------------\n" + "It took this many sedonds: " + time)
print("There were " + str(dataCount) + " many data generations.")
Here is the cupy version:
import time
import cupy as cp
size = 5000
upperBound = 20
dataSet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
dataLength = cp.random.randint(0, high=upperBound, size= size,dtype='l')
randomNumber = cp.random.randint(0, high=62, size= upperBound * size,dtype='l')
count = 0
dataCount = 0
start_time = time.time()
for i in range(size):
lineData = ""
for j in range(int(dataLength[i])):
lineData = lineData + str(dataSet[int(randomNumber[count])])
count = count + 1
print(lineData)
dataCount = dataCount + 1
time = str(time.time() - start_time)
print("-------------------\n" +"It took this many seconds: " + time)
print("There were " + str(dataCount) + " many data generations.")
They are essentially the same code except for the fact that one is using numpy and the other is using cupy. I was expecting cupy to execute faster due to the GPU ussage, but that was not the case. The run time for numpy was: 0.032. While the run time for cupy was: 0.484.
Upvotes: 4
Views: 1982
Reputation: 4291
I cannot see a user-defined kernel in this code, so it is not using the GPU for any weighty matrix calculation. So the delay moving data to/from GPU and type conversions probably predominate.
Upvotes: 0
Reputation: 94
This is a pitfall that catches many people new to GPUs. It is very common for a naive GPU version of a program to be slower than the CPU version. Making code go fast with a GPU is not trivial, mostly because of the extra latencies for copying data to and from the GPU. Whatever speedup you get from using the GPU has to overcome this overhead first. You are not doing nearly enough work on the GPU to make the overhead worth it. You're spending far more time in that cp.random.randint() call waiting for data to move than you are actually calculating anything. Do more work on the GPU and you will see the GPU take charge, like maybe a reduction operation on a large data set.
Numpy is much faster than you might expect because it is written in well-optimized C under the covers. It is not pure Python. So the benchmark you're trying to beat is actually quite fast.
If you really want to explore the depths of GPU performance tuning, try writing some CUDA and use the NVIDIA Visual Profiler to check out what the GPU is actually doing. Supposedly cupy has hooks for this but I've never used it: https://docs-cupy.chainer.org/en/stable/reference/cuda.html#profiler
Upvotes: 6