Reputation: 3009
I started my adventure with CUDA today. I'm trying to share an unsigned int among all the threads. All the threads modify this value. I copied this one value to device by using cudaMemcpy. But, at the end when calculations are finished I received that this value is equal to 0.
Maybe several threads are writing to this variable at the same time? I'm not sure if I should use any semaphores or lock this variable when a thread starts writing or what.
EDIT:
It's hard to say in more detail because my question is in general how to solve it. Actually I'm not writing any algorithm, only testing CUDA.
But if you wish... I created vector which contains some values (unsigned int). I tried to do something like searching values bigger than given shared value but, when value from vector is bigger, I'm adding 1 to the vector elements and save the shared value.
It looks like the this:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
*b = a[idx]+1;
}
As I said it's not useful code, only for testing, but I wonder how to do it...
Upvotes: 1
Views: 2114
Reputation: 96167
edit - deleted error
Although ideally you don't want to do this - unless you can be sure all the threads are going to take about the same time See Cuda thread tutorial
Upvotes: -1
Reputation: 2053
"My question is in general how to use shared memory global for every threads."
To read you don't need anything special. What you did works, faster on Fermi devices because they have a cache, slower on the others.
If you are reading the value after other threads changed it you have no way to wait for all threads to finish their operations before reading the value you want so it might not be what you expect. The only way to synchronize a value in global memory between all running threads is to use different kernels. After you change a value you want to share between all threads the kernel finishes and you launch a new one that will work with the shared value.
To make every thread write to the same memory location you must use atomic operations but keep in mind you should keep atomic operations to a minimum as this effectively serializes the execution.
To know the available atomic functions read section B.11 of the CUDA C Programming Guide available here.
What you asked would be:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
//*b = a[idx]+1;
atomicAdd(b, a[idx]+1);
}
Upvotes: 1
Reputation: 5027
If the value is in shared memory it will only be local to every thread that runs in a single multiprocessor(i.e. per thread block) and NOT to every thread that runs for that kernel. You will definitely need to perform atomic operations (such as atomicAdd etc) if you expect each thread to be writing to the variable simultanesouly. Be aware though that this will serialize all simultaneous thread requests for writing to the variable.
Upvotes: 1