Reputation: 13
I'm trying to accomplish Nvidia's "Fundamentals of Accelerated Computing with CUDA Python" course and have got a task to refactor a simple version of some code that performs work needed to create a hidden layer in a neural network:
import numpy as np
from numba import cuda, vectorize
n = 1000000
greyscales = np.floor(np.random.uniform(0, 255, n).astype(np.float32))
weights = np.random.normal(.5, .1, n).astype(np.float32)
from numpy import exp
def normalize(grayscales):
return grayscales / 255
def weigh(values, weights):
return values * weights
def activate(values):
return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )
def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
normalized = normalize(greyscales)
weighted = weigh(normalized, weights)
activated = activate(weighted)
return activated
arguments = {"n":n,
"greyscales": greyscales,
"weights": weights,
"exp": exp,
"normalize": normalize,
"weigh": weigh,
"activate": activate}
a = create_hidden_layer(**arguments)
print(a)
I have transformed the code a little bit and after modifications, it looks like this:
from math import exp
@vectorize(['float32(float32)'],target='cuda')
def normalize(grayscales):
return grayscales / 255
@vectorize(['float32(float32,float32)'],target='cuda')
def weigh(values, weights):
return values * weights
@vectorize(['float32(float32)'],target='cuda')
def activate(values):
return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )
def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
normalized = normalize(greyscales)
weighted = weigh(normalized, weights)
activated = activate(weighted)
return activated
greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)
normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)
activated = activated.copy_to_host()
arguments = {"n":n,
"greyscales": greyscales,
"weights": weights,
"exp": exp,
"normalize": normalize,
"weigh": weigh,
"activate": activate}
a = create_hidden_layer(**arguments)
print(a)
The code seems to work fine after all the transformations, but there is one but... It's not fast enough. In the task, it is stated that the code should run in less than 1s, while my code runs in 1.23s...
Maybe someone knows how I could refactor my code more? Or maybe notices any silly mistakes I have made in my code? Would be very grateful for any help!
Upvotes: 1
Views: 627
Reputation: 36
greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)
normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)
activated = activated.copy_to_host()
Move this section inside the "create_hidden_layer" function. I did that and it ran in ~0.5 secs.
Upvotes: 1
Reputation: 180
Here are some things that you could try to speed up your code:
@cuda.jit
to compile your kernel.cuda.grid(2)
to get the 2D thread index and use
cuda.blockDim.x
to get the number of threads in a block. Use those to calculate the 1D index of your array and store it in a shared memory array.cuda.synchronize()
to wait for all threads to reach that point in the kernel. Then, use the shared memory array to access the data from global memory.cuda.shared.array()
and
cuda.shared.to_device()
to create and copy the shared memory array to the GPU.cuda.synchronize()
to wait for all threads to reach the end of the kernel. Then, use
cuda.from_device()
to copy the data back to the CPU.cuda.to_device()
and
cuda.from_device()
to copy data between the CPU and GPU, if you want to.cuda.device_array_like()
to create an array on the GPU that is similar to an array on the CPU.Upvotes: 0