kndrtt
kndrtt

Reputation: 13

GPU-accelerate neural network calculations

I'm trying to accomplish Nvidia's "Fundamentals of Accelerated Computing with CUDA Python" course and have got a task to refactor a simple version of some code that performs work needed to create a hidden layer in a neural network:

import numpy as np
from numba import cuda, vectorize

n = 1000000

greyscales = np.floor(np.random.uniform(0, 255, n).astype(np.float32))
weights = np.random.normal(.5, .1, n).astype(np.float32)

from numpy import exp

def normalize(grayscales):
    return grayscales / 255

def weigh(values, weights):
    return values * weights
    
def activate(values):
    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )

def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
    normalized = normalize(greyscales)
    weighted = weigh(normalized, weights)
    activated = activate(weighted)
    return activated

arguments = {"n":n,
            "greyscales": greyscales,
            "weights": weights,
            "exp": exp,
            "normalize": normalize,
            "weigh": weigh,
            "activate": activate}

a = create_hidden_layer(**arguments)
print(a)

I have transformed the code a little bit and after modifications, it looks like this:

from math import exp

@vectorize(['float32(float32)'],target='cuda')
def normalize(grayscales):
    return grayscales / 255

@vectorize(['float32(float32,float32)'],target='cuda')
def weigh(values, weights):
    return values * weights

@vectorize(['float32(float32)'],target='cuda')
def activate(values):
    return ( exp(values) - exp(-values) ) / ( exp(values) + exp(-values) )

def create_hidden_layer(n, greyscales, weights, exp, normalize, weigh, activate):
    normalized = normalize(greyscales)
    weighted = weigh(normalized, weights)
    activated = activate(weighted)
    return activated

greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)

normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)

activated = activated.copy_to_host()

arguments = {"n":n,
            "greyscales": greyscales,
            "weights": weights,
            "exp": exp,
            "normalize": normalize,
            "weigh": weigh,
            "activate": activate}

a = create_hidden_layer(**arguments)
print(a)

The code seems to work fine after all the transformations, but there is one but... It's not fast enough. In the task, it is stated that the code should run in less than 1s, while my code runs in 1.23s...

Maybe someone knows how I could refactor my code more? Or maybe notices any silly mistakes I have made in my code? Would be very grateful for any help!

Upvotes: 1

Views: 627

Answers (2)

Ani
Ani

Reputation: 36

greyscales = cuda.to_device(greyscales)
weights = cuda.to_device(weights)

normalized = cuda.device_array(shape=(n,), dtype=np.float32)
weighted = cuda.device_array(shape=(n,), dtype=np.float32)
activated = cuda.device_array(shape=(n,), dtype=np.float32)

activated = activated.copy_to_host()

Move this section inside the "create_hidden_layer" function. I did that and it ran in ~0.5 secs.

Upvotes: 1

James
James

Reputation: 180

Here are some things that you could try to speed up your code:

  1. Use @cuda.jit to compile your kernel.
  2. In your kernel, use cuda.grid(2) to get the 2D thread index and use cuda.blockDim.x to get the number of threads in a block. Use those to calculate the 1D index of your array and store it in a shared memory array.
  3. In your kernel, once all threads have reached the shared memory array, use cuda.synchronize() to wait for all threads to reach that point in the kernel. Then, use the shared memory array to access the data from global memory.
  4. Use cuda.shared.array() and cuda.shared.to_device() to create and copy the shared memory array to the GPU.
  5. Once your kernel is done, use cuda.synchronize() to wait for all threads to reach the end of the kernel. Then, use cuda.from_device() to copy the data back to the CPU.
  6. You can also use cuda.to_device() and cuda.from_device() to copy data between the CPU and GPU, if you want to.
  7. It is also possible to use cuda.device_array_like() to create an array on the GPU that is similar to an array on the CPU.

Upvotes: 0

Related Questions