Jimbo
Jimbo

Reputation: 3284

how to sync threads in this cuda example

I have the following rough code outline:

  1. run a loop, millions of times
  2. in that loop, compute values 'I's - see example of such functions below
  3. After all 'I's have been computed, compute other values 'V's
  4. repeat the loop

Each computation of an I or V could involve up to 20ish mathematical operations, (e.g. I1 = A + B/C * D + 1/exp(V1) - E + F + V2 etc).

There are roughly:

  1. 50 'I's
  2. 10 'V's
  3. 10 values in each I and V, i.e. they are vectors of length 10

At first I tried running a simple loop in C, with kernel calls for each time step but this was really slow. It seems like I can get the code to run faster if the main loop is in a kernel that calls other kernels. However, I'm worried about kernel call overhead (maybe I shouldn't be) so I came up with something like the following, where each I and V loop independently, with syncing between the kernels as necessary.

For reference, the variables below are hardcoded as __device__ values, but eventually I will pass some values into specific kernels to make the system interesting.

__global__ void compute_IL1()
{
    int id = threadIdx.x; 
    //n_t = 1e6;
    for (int i = 0; i < n_t; i++){
        IL1[id] = gl_1*(V1[id] - El_1);
        //atomic, sync, event????,
    }
}        

__global__ void compute_IK1()
{
    int id = threadIdx.x; 
    for (int i = 0; i < n_t; i++){
        Ik1[id] = gk_1*powf(0.75*(1-H1[id]),4)*(V1[id]-Ek_1);
        //atomic, sync, event?
    }
}   

__global__ void compute_V1()
{
   int id = threadIdx.x; 
   for (int i = 0; i < n_t; i++){
       //wait for IL1 and Ik1 and others, but how????
       V1[id] = Ik1[id]+IL1[id] + ....
       //trigger the I's again
   }
}


//main function
compute_IL1<<<1,10,0,s0>>>();
compute_IK1<<<1,10,0,s1>>>();
//repeat this for many 50 - 70 more kernels (Is and Vs)

So the question is, how would I sync these kernels? Is an event approach best? Is there a better paradigm to use here?

Upvotes: 1

Views: 505

Answers (2)

Jimbo
Jimbo

Reputation: 3284

So, I tried a couple of approaches.

  1. A loop with a few kernel calls, where the last kernel call is dependent on the previous ones. This can be done with cudaStreamWaitEvent which can wait for multiple events. I found this on: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/ . Unfortunately, the kernel calls were too expensive.

  2. Global variables between concurrent streams. The logic was pretty simple, having one thread pause until a global variable equaled the loop variable, indicating that all threads could proceed. This was then followed by a sync-threads call. Unfortunately, this did not work well.

Ultimately, I think I've settled on a nested loop, where the outer loop represents time, and the inner loop indicates which of a set instructions to run, based on dependencies. I also launched the maximum number of threads per block (1024) and broke up the vectors that needed to be processed into warps. The rough psuedocode is:

run_main<<<1,1024>>>();

__global__ void run_main(){
int warp = threadIdx.x/32;
int id   = threadIdx.x - warp*32;

if (id < 10){
    for (int i = 0; i < n_t; i++){
        for(int j = 0; j < n_j; j++){
            switch (j){
                case 0:
                    switch(warp){
                        case 0:
                            I1[id] = a + b + c*d ...
                            break;
                        case 1:
                            I2[id] = f*g/h
                            break;
                    }
                break;
                //These things depend on case 0 OR
                //we've run out of space in the first pass
                //32 cases max [0 ... 31]
                case 1:
                    switch(warp){
                        case 0:
                            V1[ID] = I1*I2+ ...
                            break;
                        case 1:
                            V2[ID] = ...

        //syncs across the block
        __syncthreads();

This design is based on my impression that each set of 32 threads runs independently but should run the same code, otherwise things can slow done significantly.

So at the end, I'm running roughly 32*10 instructions simultaneously. Where 32 is the number of warps, and it depends on how many different values I can compute at the same time (due to dependencies) and 10 is the # of elements in each vector. This is slowed down by any imbalances in the # of computations in each warp case, since all warps need to merge before moving onto the next step (due to the syncthreads call). I'm running different parameters (parameter sweep) on top of this, so I could potentially run 3 at a time in the block, multiplied by the # of streaming processors (or whatever the official name is) on the card.

One thing I need to change is that I'm currently testing on a video card that is attached to a monitor as well. Apparently Windows will kill the kernel if it lasts for more than 5 seconds, so I need to call the kernel in chunked time steps, like once every 1e5 time steps (in my case).

Upvotes: 0

talonmies
talonmies

Reputation: 72349

There is no sane mechanism I can think of to have multiple resident kernels synchronize without resorting to hacky atomic tricks which may well not work reliably.

If you are running blocks with 10 threads and these kernels cannot execute concurrently for correctness reasons, you are (in the best possible case) using 1/64 of the computational capacity of your device. This problem as you have described it sounds completely Ill suited to a GPU.

Upvotes: 1

Related Questions