dot product without for loop with C Cuda

Question

I'm trying to write the c-cuda code to implement the dot product without a for loop inside the kernel. The following code preforms the tiling of the input vectors, which are respectively filled with 10 and 15, into the correspondents shared float arrays s_in1 and s_in2; the result of the multiplication between each element of these arrays is stored into a shared float array block. The result is correct (4'800'000) for a size of input array 32000 (inputLength=32000), but with a size 320000 (inputLength=320000) the result is wrong (48'192'608 instead of 48'000'000). Why? Even if I rewrite the code using a variable float block instead of the shared array, it exhibits the same problem. The result is always the same each time I execute the code. Thanks in advance for your help!

I compile the code on a Jetson TX1 - CUDA 7.0 with:

nvcc mycode.cu -o mycode

and this is the complete code:

#define THREADS_PER_BLOCK 1000

__global__ void scalar_prod(float *in1, float *in2, float *out) 
{

__shared__ float block[THREADS_PER_BLOCK];
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];

unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];

block[threadIdx.x] =  s_in1[threadIdx.x] * s_in2[threadIdx.x];
__syncthreads();
atomicAdd(out, block[threadIdx.x]);
}

int main()
{

int inputLength=320000;
float *hostInput1;
float *hostInput2;
float  hostOutput=0;
float *deviceInput1;
float *deviceInput2;
float *deviceOutput;
unsigned int i;

hostInput1=(float*) malloc(inputLength*sizeof(float));
hostInput2=(float*) malloc(inputLength*sizeof(float));

for(i=0;i>>(deviceInput1, deviceInput2, deviceOutput);

cudaDeviceSynchronize();

cudaMemcpy(&hostOutput, deviceOutput,sizeof(float), cudaMemcpyDeviceToHost);

printf("
 result:%f 
",hostOutput);

cudaFree(deviceInput1);
cudaFree(deviceInput2);
cudaFree(deviceOutput);
free(hostInput1);
free(hostInput2); 
return 0;     
}

Robert Crovella · Accepted Answer

There are at least 2 issues with the code:

You are not initializing the storage pointed to by deviceOutput before you start doing atomicAdd operations on it. So the initial value is undefined.
You are exceeding the capability of float arithmetic.

The fix for item 1 is trivial - we can easily initialize this to zero before running the kernel. For item 2, a straightforward "fix" would be to switch everything from float to double. However, on your Jetson GPU, we don't have a convenient atomicAdd intrinsic for double values, but the programming guide gives us a possible implementation using atomicCAS. If we combine these, we can end up with a code that works:

$ cat t122.cu
#include 
#define THREADS_PER_BLOCK 1000

#ifdef USE_DOUBLE
typedef double mytype;
#else
typedef float mytype;
#endif

__device__ double my_atomicAdd(double* address, double val) {
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed))); // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
  return __longlong_as_double(old);
}
__device__ float my_atomicAdd(float *addr, float val){
  return atomicAdd(addr, val);
}

__global__ void scalar_prod(mytype *in1, mytype *in2, mytype *out)
{
__shared__ mytype block[THREADS_PER_BLOCK];
__shared__ mytype s_in1[THREADS_PER_BLOCK];
__shared__ mytype s_in2[THREADS_PER_BLOCK];

unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];

block[threadIdx.x] =  s_in1[threadIdx.x] * s_in2[threadIdx.x];
__syncthreads();
my_atomicAdd(out, block[threadIdx.x]);
}

int main()
{

int inputLength=320000;
mytype *hostInput1;
mytype *hostInput2;
mytype  hostOutput=0;
mytype *deviceInput1;
mytype *deviceInput2;
mytype *deviceOutput;
unsigned int i;

hostInput1=(mytype*) malloc(inputLength*sizeof(mytype));
hostInput2=(mytype*) malloc(inputLength*sizeof(mytype));

for(i=0;i>>(deviceInput1, deviceInput2, deviceOutput);

cudaDeviceSynchronize();

cudaMemcpy(&hostOutput, deviceOutput,sizeof(mytype), cudaMemcpyDeviceToHost);

printf("
 result:%f 
",hostOutput);

cudaFree(deviceInput1);
cudaFree(deviceInput2);
cudaFree(deviceOutput);
free(hostInput1);
free(hostInput2);
return 0;
}
$ nvcc -arch=sm_30 -o t122 t122.cu -DUSE_DOUBLE
$ ./t122

 result:48000000.000000
$

Note that there are various other items about this code that don't make much sense - for example, the usage of __shared__ memory here is not giving you any benefit, since there is no actual sharing of data between threads, and no input data reuse in vector dot product. But these didn't seem to be the focus of your question and don't make the code incorrect.

dot product without for loop with C Cuda

Answers (1)

Related Questions