Reputation: 1212
I am using a Tesla C1060 with 1.3 compute capability and nvcc compiler driver 4.0. I am trying to do some computation local to thread block. Each thread block is provided with a shared array which is first initialized to zero values. For synchronizing concurrent updates (addition) to shared data by threads of the thread block, I use CUDA atomicAdd
primitive.
Once each thread block is ready with the results in its shared data array, each entry in shared data array is iteratively merged (using atomicAdd
) to corresponding entries in global data array.
Below is a code very similar to what I basically trying to do.
#define DATA_SZ 16
typedef unsigned long long int ULLInt;
__global__ void kernel( ULLInt* data, ULLInt ThreadCount )
{
ULLInt thid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ ULLInt sharedData[DATA_SZ];
// Initialize the shared data
if( threadIdx.x == 0 )
{
for( int i = 0; i < DATA_SZ; i++ ) { sharedData[i] = 0; }
}
__syncthreads();
//..some code here
if( thid < ThreadCount )
{
//..some code here
atomicAdd( &sharedData[getIndex(thid), thid );
//..some code here
for(..a loop...)
{
//..some code here
if(thid % 2 == 0)
{
// getIndex() returns a value in [0, DATA_SZ )
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
}
}
}
__syncthreads();
if( threadIdx.x == 0 )
{
// ...
for( int i = 0; i < DATA_SZ; i++ ) { atomicAdd( &Data[i], sharedData[i] ); }
//...
}
}
If I compile with -arch=sm_20
I don't get any errors. However when I compile the kernel using the -arch=sm_13
option I get the following errors:
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas fatal : Ptx assembly aborted due to errors
If I comment out the following two lines I don't get any errors with -arch=sm_13
:
atomicAdd( &sharedData[getIndex(thid), thid );
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
Can someone suggest what I might be doing wrong?
Upvotes: 1
Views: 1574
Reputation: 1212
Found the solution in CUDA C programming guide: Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above. Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.
So basically I cannot use ULLInt fro shared memory here and somehow I need to use unsigned int
Upvotes: 1