KuroNeko
KuroNeko

Reputation: 331

Numbapro jit calculation gives incorrect result

I have a piece of code that uses Numbapro to write a simple kernel to square the contents of two arrays of size 41724,add them together and store it into another array. All the arrays have the same size and are float32. The code is below:

import numpy as np
from numba import *
from numbapro import cuda

@cuda.jit('void(float32[:],float32[:],float32[:])')
def square_add(a,b,c):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x

    i = tx + bx * bw

    #Since the length of a is 41724 and the total
    #threads is 41*1024 = 41984, this check is necessary
    if (i>len(a)):
            return
    else:
            c[i] = a[i]*a[i] + b[i]*b[i]


a = np.array(range(0,41724),dtype = np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c,copy=False)

#Launch the kernel; Gridsize = (1,41),Blocksize=(1,1024)
square_add[(1,41),(1,1024)](d_a,d_b,d_c)

c = d_c.copy_to_host()
print c
print len(c[0])

The values I am getting when I print the result of the operation (the array c) is completely different compared to that when I do the exact same thing in a python terminal. I do not know what I am doing wrong here.

Upvotes: 0

Views: 365

Answers (1)

talonmies
talonmies

Reputation: 72342

There a two problems here.

The first is that you are specifying a block and grid dimension for your CUDA kernel launch which is incompatible with the indexing scheme you have chosen to use in the kernel.

This:

square_add[(1,41),(1,1024)](d_a,d_b,d_c)

launches a two dimensional grid where all the threads have the same block and thread dimensions in x, and vary in only in y. This implies that

tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x

i = tx + bx * bw

will yield i=0 for every thread. If you change the kernel launch to this:

square_add[(41,1),(1024,1)](d_a,d_b,d_c)

you will find that in indexing will work correctly.

The second is that c has been declared as a two dimensional array, but the kernel function signature has been declared as a one dimensional array. Under some circumstances, the numbapro runtime should detect this and raise an error.

I was able to get your example to work correctly like this:

import numpy as np
from numba import *
from numbapro import cuda

@cuda.jit('void(float32[:],float32[:],float32[:,:])')
def square_add(a,b,c):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x

    i = tx + bx * bw

    if (i<len(a)):
        c[0,i] = a[i]*a[i] + b[i]*b[i]

a = np.array(range(0,41724),dtype=np.float32)
b = np.array(range(41724,83448),dtype=np.float32)
c = np.zeros(shape=(1,41724),dtype=np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c, copy=False)

square_add[(41,1),(1024,1)](d_a,d_b,d_c)

c = d_c.copy_to_host()
print(c)
print(c.shape)

[Note I am using Python 3, so this uses new style print statements]

$ ipython numbatest.py 
numbapro:1: ImportWarning: The numbapro package is deprecated in favour of the accelerate package. Please update your code to use equivalent functions from accelerate.
[[  1.74089216e+09   1.74097562e+09   1.74105907e+09 ...,   8.70371021e+09
    8.70396006e+09   8.70421094e+09]]
(1, 41724)

Upvotes: 1

Related Questions