Marc41
Marc41

Reputation: 3

CUDA C++ shared memory and if-condition

i have a question i couldnt find an answer to myself, and i was hoping some of you could offer me some insight regarding a possible solution. Within a kernel call, i would like to insert an if-condition regarding access to shared memory.

__global__ void GridFillGPU (int * gridGLOB, int n) {
    __shared__ int grid[SIZE] // ... initialized to zero
    int tid = threadIdx.x
        if (tid < n) {
            for ( int k = 0; k < SIZE; k++) { 
                if (grid[k] == 0) {
                    grid[k] = tid+1;
                    break;
                }
            }
        }
    //... here write grid to global memory gridGLOB
    }

The idea is that, if the element grid[k] has already been written by one thread (with the index tid), it should not be written by another one. My question is: can this even be done in parallel ? Since all parallel threads perform the same for-loop, how can i be sure that the if-condition is evaluated correctly? I am guessing this will lead to certain race-conditions. I am quite new to Cuda, so i hope this question is not stupid. I know that grid needs to be in shared memory, and that one should avoid if-statements, but i find no other way around at the moment. I am thankful for any help

EDIT: here is the explicit version, which explains why the array is called grid

__global__ void GridFillGPU (int * pos, int * gridGLOB, int n) {
    __shared__ int grid[SIZE*7] // ... initialized to zero
    int tid = threadIdx.x
        if (tid < n) {
        int jmin = pos[tid] - 3;
        int jmax = pos[tid] + 3;
          for ( int j = jmin; j <= jmax; j++ { 
            for ( int k = 0; k < SIZE; k++) { 
                if (grid[(j-jmin)*SIZE + k] == 0) {
                    grid[(j-jmin)*SIZE + k] = tid+1;
                    break;
                }
            }
        }
    } //... here write grid to global memory gridGLOB
}

Upvotes: 0

Views: 1447

Answers (1)

Marco Giordano
Marco Giordano

Reputation: 610

You should model you problem in a way you don't need to worry about "if has been written already", also because cuda offers no guarantee in the order in which thread will be executed, so the order might not be the way you excpect. There are some minor things that cuda ensure you order wise within a warp but that is not the case. There are sync barries and stuff you can use but I don't think is your case.

if you are processing a grid you should model that in a way that each thread has its own region of memory is going to work on. and that should not overlap with other thread region (at least in writing, in reading you can go outside boundaries). Also I would not worry about shared memory, make the algorithm works first, then think about optimization like load a tile in shared memory using the warp.

In that case if you want to split your domain in a grid you should setup the kernel, in order to have enough threads as your grid "cells" or pixels if is an image. Then you use the thread and block coordinates that cuda provides you to compute where you should read and write in memory.

There is a really good course on udacity.com about cuda, you might want to have a look at that. https://www.udacity.com/courses/cs344 There is also another one on coursera.com but I don't know if it is open right now. Anyway dividing the domain in a grid is a really common and solved problem ,you can find a lot of material on that.

Upvotes: 0

Related Questions