user1192151
user1192151

Reputation:

Using Shared & Constant Memory in CUDA

I want to read a text file and store it in an array. Then, I want to transfer the array from the host to the device and store it in the shared memory. I have written the following code,but the execution time has been increased compared with using the global memory. I cannot understand what the reason can be? Also, it will be great if someone can help me write this code using constant memory.

__global__ void deviceFunction(char *pBuffer,int pSize){
    extern __shared__ char p[];
    int i;
    for(i=0;i<pSize;i++)}
        p[i] = pBuffer[i];
    }
}
int main(void){

    cudaMalloc((void**)&pBuffer_device,sizeof(char)*pSize);
    cudaMemcpy(pBuffer_device,pBuffer,sizeof(char)*pSize,cudaMemcpyHostTo Device);
    kernel<<<BLOCK,THREAD>>>(pBuffer_device,pSize);

}

Upvotes: 0

Views: 1353

Answers (1)

djmj
djmj

Reputation: 5554

  1. Maybe because every thread in a block tries to write the same shared memory addresses concurrent ranging from 0 to pSize!
    Use thread collaborative loading of global memory data into shared memory: http://forums.nvidia.com/index.php?showtopic=216640&view=findpost&p=1332005
    Every thread in your kernel performs "pSize" global memory reads.

Upvotes: 1

Related Questions