Purple Dawn
Purple Dawn

Reputation: 23

CUDA: shared data member of struct and member of reference to that struct have different addresses, values

Ok, heres the problem:
Using a CUDA 1.1 compute GPU, I am trying to maintain a set of (possibly varying number of, here fixed to 4) indices per thread, a reference to which I keep as a member of a struct variable.
My problem is that getting a reference to the struct then results in incorrect results when accessing the member array: I initialize the member array values with 0, when I read the array values using the original struct variable, I get the correct value (0), but when I read it using a reference to the struct var, I get garbage (-8193). This happens even if using a class instead of a struct.

Why does tmp go below/is not equal to 0?

C++ isn't my primary language, so this may be a conceptual issue, or it may be a quirk of working in CUDA.

struct DataIdx {
    int numFeats;
    int* featIdx;
};
extern __shared__ int sharedData[];

__global__  void myFn(){
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    
    DataIdx myIdx;  //instantiate the struct var in the context of the current thread
    myIdx.numFeats = 4;
    size_t idxArraySize = sizeof(int)*4;
    //get a reference to my array for this thread. Parallel Nsight debugger shows myIdx.featIdx address = 0x0000000000000000e0
    myIdx.featIdx = (int*)(&sharedData[tidx*idxArraySize]);  
    
    myIdx.featIdx[0] = 0x0;  //set first value to 0 
    int tmp = myIdx.featIdx[0];  // tmp is correctly eq to 0 in Nsight debugger -- As Expected!!
    tmp = 2*tmp;    antIdx.featIdx[0] = tmp; //ensure compiler doesn't elide out tmp
    
    DataIdx *tmpIdx = &myIdx;  //create a reference to my struct var
    tmp = tmpIdx.featIdx[0];   // expected 0, but tmp = -8193 in debugger !! why?  debugger shows address of tmpIdx.featIdx = __devicea__ address=8
    tmpIdx.featIdx[0] = 0x0;
    tmp = tmpIdx.featIdx[0]; // tmp = -1; cant even read what we just set
    
    //forcing the same reference as myIdx.featIdx, still gives a problem! debugger shows address of tmpIdx.featIdx = __devicea__ address=8
    tmpIdx->featIdx =  (int*)(&sharedData[tidx*idxArraySize]); 
    tmp = tmpIdx.featIdx[0]; //tmp = -8193!! why != 0?

    DataIdx tmpIdxAlias = myIdx;
    tmp = tmpIdx.featIdx[0]; //aliasing the original var gives correct results, tmp=0
    
    
     myIdx.featIdx[0] = 0x0;
     mySubfn(&myIdx); //this is a problem because it happens when passing the struct by reference to subfns
     mySubfn2(myIdx);
}
__device__ mySubfn(struct DataIdx *myIdx){
  int tmp = myIdx->featIdx[0]; //tmp == -8193!! should be 0
}
__device__ mySubfn2(struct DataIdx &myIdx){
  int tmp = myIdx.featIdx[0]; //tmp == -8193!! should be 0
}

Upvotes: 1

Views: 881

Answers (1)

Greg Smith
Greg Smith

Reputation: 11539

I had to modify your code to compile. In the line

tmpIdx->featIdx[0] = 0x0

the compiler is failing to understand the the pointer is to shared memory. Instead of doing a store to shared memory (R2G) it is doing a store to the global address 0x10 which is out of bounds.

    DataIdx *tmpIdx = &myIdx;
0x000024c8  MOV32 R2, R31;  
0x000024cc  MOV32 R2, R2;  
    tmp = tmpIdx->featIdx[0];
    tmpIdx->featIdx[0] = 0x0;
0x000024d0  MOV32 R3, R31;  
0x000024d4  MOV32 R2, R2;  
0x000024d8  IADD32I R4, R2, 0x4;  
0x000024e0  R2A A1, R4;  
0x000024e8  LLD.U32 R4, local [A1+0x0];  
0x000024f0  IADD R4, R4, R31;  
0x000024f8  SHL R4, R4, R31;  
0x00002500  IADD R4, R4, R31;  
0x00002508  GST.U32 global14 [R4], R3;   // <<== GLOBAL STORE vs. R2G (register to global register file)
    tmp = tmpIdx->featIdx[0];

The Nsight CUDA Memory Checker catches the out of bounds store to global memory.

Memory Checker detected 1 access violations.
error = access violation on store (global memory)
blockIdx = {0,0,0}
threadIdx = {0,0,0}
address = 0x00000010
accessSize = 0

If you compile for compute_10,sm_10 (actually <= 1.3) you should see the following warning for each line that the compiler cannot determine that the access is to shared memory:

kernel.cu(46): warning : Cannot tell what pointer points to, assuming global memory space

If you add a cudaDeviceSynchronize after the launch you should see the error code cudaErrorUnknown caused by the out of bounds memory access.

__shared__ is a variable memory qualifier not a type qualifier so I do know how you would tell the compiler that featIdx will always point to shared memory. On CC >= 2.0 the compiler should convert (int*)(&sharedData[tidx*idxArraySize]) to a generic pointer.

Upvotes: 1

Related Questions