I am a bit confused about synchronization. Using __syncthreads you can synchronize threads in a block.This, (the use of __syncthreads ) must be done only with shared memory? Or using shared memory with __syncthreads has best performance? Generally, threads may only safely communicate with each other if and only if they exist within the same thread block, right? So, why don't we always use shared memory? Because it's not big enough? And, if we don't use shared memory how can we ensure that results are right? I have a program that sometimes runs ok (I get the results) and sometimes i get 'nan' results without altering anything. Can that be a problem of synchronization?

The use of __syncthreads does not involve shared memory, it only ensures synchronization within a block. But you need to synchronize threads when you want them to share data through shared memory. We don't always use shared memory because it is quite small, and because it can slow down your application when badly used. This is due to potential bank conflicts when badly addressing shared memory. Moreover, recent architectures (from 2.0) implement shared memory in the same hardware area than cache. Thus, some seasoned CUDA developers recommend not to use shared memory and rely on the cache mechanisms only. Can be. If you want to know whether it is a deadlock, try to increase the number of blocks you're using. If it is a deadlock, your GPU should freeze. If it is not, post your code, it will be easier for us to answer ;)

cuda

George

Reputation: 5691

CUDA thread synchronization

I am a bit confused about synchronization.

Using __syncthreads you can synchronize threads in a block.This, (the use of __syncthreads) must be done only with shared memory? Or using shared memory with __syncthreads has best performance?
Generally, threads may only safely communicate with each other if and only if they exist within the same thread block, right? So, why don't we always use shared memory? Because it's not big enough? And, if we don't use shared memory how can we ensure that results are right?
I have a program that sometimes runs ok (I get the results) and sometimes i get 'nan' results without altering anything. Can that be a problem of synchronization?

Upvotes: 0

Answers (3)

jopasserat

Reputation: 5930

The use of __syncthreads does not involve shared memory, it only ensures synchronization within a block. But you need to synchronize threads when you want them to share data through shared memory.
We don't always use shared memory because it is quite small, and because it can slow down your application when badly used. This is due to potential bank conflicts when badly addressing shared memory. Moreover, recent architectures (from 2.0) implement shared memory in the same hardware area than cache. Thus, some seasoned CUDA developers recommend not to use shared memory and rely on the cache mechanisms only.
Can be. If you want to know whether it is a deadlock, try to increase the number of blocks you're using. If it is a deadlock, your GPU should freeze. If it is not, post your code, it will be easier for us to answer ;)

Upvotes: 2

chaohuang

Reputation: 4115

Although shared memory and __syncthreads() are independent concepts, but they often go hand in hand. Otherwise if threads operate independently, there is no need to use __syncthreads().
Two aspects restrict the use of shared memory: 1). the size of shared memory is limited 2). to achieve best performance, you need to avoid bank conflict when using shared memory.
It could be due to the lack of __syncthreads(). Sometimes, using shared memory without __syncthreads() could lead to unpredictable results.

Upvotes: 0

gamerx

Reputation: 579

__syncthreads() and shared memory are independent ideas, you don't need one to use the other. The only requirement for using __syncthreads() that comes to my mind is that all the threads must eventually arrive at the point in the code, otherwise your program will simply hang.

As for shared memory, yes it's probably a matter of size that you don't see it being used all the time. From my understanding shared memory is split amongst all blocks. For example, to launch a kernel using a shared memory of 1kb with a 100 blocks will require 100kb which exceeds what is available on the SM.

Upvotes: 0

CUDA thread synchronization

Answers (3)

Related Questions