CUDA - Synchronous Threads - Wait until first in is finish with writing

Question

I am trying to do the following (simplify): Please read the edit section!

__shared__ int currentPos = 0;
__global__ myThreadedFunction(float *int, float *out)
{
    // do calculations with in values
    ...

    // now first thread reach this:
    //suspend other threads up here

    out += currentPos;
    for (int i = 0; i < size; ++i)
    {
        *(currentPos++) =  calculation[i];
    }
    currentPos +=  size;

    // now thread is finish, other threads can
    // go on with writing
}

So how do I suspend threads before writing to same memory? I cannot write concurrently, because I do not know the size of each calculatet array (calculation[i] - size).

I know there is syncthreads and threadfence but I don´t know how I must use them right for this problem.

Edit: What I want to do is:

I have got 2 threads (just for example). Each thread is calculating with the float *in a new array.

Thread 1 calculated: { 1, 3, 2, 4 }

Thread 2 calculated: { 3, 2, 5, 6, 3, 4 }

The size of these arrays is known after the calculation. Now I want to write these arrays in the float *out.

It is not necessary for me, if first thread 1 or thread 2 is writing. The output could be: * { 1, 3, 2, 4, 3, 2, 5, 6, 3, 4 } or { 3, 2, 5, 6, 3, 4, 1, 3, 2, 4} *

So how to calculate the positions of the output array?

I don´t want to use a fixed "array size" so that the output would be: * { 1, 3, 2, 4, ?, ?, 3, 2, 5, 6, 3, 4 } *

I think I could us a shared variable POSITION for the next writing position.

Thread 1 reach the writing point (after calculation the new array). Thread 1 write in shared variable POSITION his array size (4).

While Thread 1 is now writing his temp-array to the output array, thread 2 reads the variable POSITION and add his tmp. array size (6) to this variable and start writing at the position where thread 1 ends

If there would be a thread 3, he would also read POSITION, add his array size and writing into the ouput, where thread 2 ends

So anyone a idea?

1-----1 · Accepted Answer

Conceptually how you would do a concurrent output using an shared array to store the indexes for each thread.

__global__ myThreadedFunction(float *int, float *out)
{

    __shared__ index[blockDim.x];//replace the size with an constant
    // do calculations with in values
    ...



    index[tid] = size;// assuming size is the size of the array you output
    //you could do a reduction on this for loop for better performance.
    for(int i = 1; i < blockDim.x; ++i) {
        __syncthreads();
        if(tid == i) {
            index[tid] += index[tid-1];
        }
    }
    int startposition = index[tid] - size; // you want to start at the start, not where the index ends

    //do your output for all threads concurrently where startposition is the first index you output to

}

So what you do is assign index[tid] to the size you want to output, where tid is the thread index threadIdx.x, then do a summation uppwards the array(increasing index), and then finally index[tid] is the offset starting index in your output array from thread 0. The summation could easily be done using reduction.

CUDA - Synchronous Threads -> Wait until first in is finish with writing

Answers (2)

Related Questions

CUDA - Synchronous Threads -&gt; Wait until first in is finish with writing

Answers (2)

Related Questions

CUDA - Synchronous Threads -> Wait until first in is finish with writing