nt2005
nt2005

Reputation: 69

CUDA - Synchronous Threads -> Wait until first in is finish with writing

I am trying to do the following (simplify): Please read the edit section!

__shared__ int currentPos = 0;
__global__ myThreadedFunction(float *int, float *out)
{
    // do calculations with in values
    ...

    // now first thread reach this:
    //suspend other threads up here

    out += currentPos;
    for (int i = 0; i < size; ++i)
    {
        *(currentPos++) =  calculation[i];
    }
    currentPos +=  size;

    // now thread is finish, other threads can
    // go on with writing
}

So how do I suspend threads before writing to same memory? I cannot write concurrently, because I do not know the size of each calculatet array (calculation[i] - size).

I know there is syncthreads and threadfence but I don´t know how I must use them right for this problem.

Edit: What I want to do is:

I have got 2 threads (just for example). Each thread is calculating with the float *in a new array.

Thread 1 calculated: { 1, 3, 2, 4 }

Thread 2 calculated: { 3, 2, 5, 6, 3, 4 }

The size of these arrays is known after the calculation. Now I want to write these arrays in the float *out.

It is not necessary for me, if first thread 1 or thread 2 is writing. The output could be: * { 1, 3, 2, 4, 3, 2, 5, 6, 3, 4 } or { 3, 2, 5, 6, 3, 4, 1, 3, 2, 4} *

So how to calculate the positions of the output array?

I don´t want to use a fixed "array size" so that the output would be: * { 1, 3, 2, 4, ?, ?, 3, 2, 5, 6, 3, 4 } *

I think I could us a shared variable POSITION for the next writing position.

Thread 1 reach the writing point (after calculation the new array). Thread 1 write in shared variable POSITION his array size (4).

While Thread 1 is now writing his temp-array to the output array, thread 2 reads the variable POSITION and add his tmp. array size (6) to this variable and start writing at the position where thread 1 ends

If there would be a thread 3, he would also read POSITION, add his array size and writing into the ouput, where thread 2 ends

So anyone a idea?

Upvotes: 1

Views: 3076

Answers (2)

kangshiyin
kangshiyin

Reputation: 9781

This code works as you expected. It concurrently read the input[]. For each input element size, it write size for size times to result in the order as stored in input[].

Please note the writing procedure may take much longer than do this on CPU. Since you've already know the size of data for each thread to write, you may want to use parallel prefix sum to calculate the writing postion for each thread first, and then write the data concurrently.

See Memory Fence Functions for more info about __threadfence() used in the code.

#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>

volatile __device__ int count = 0;
volatile __device__ int pos = 0;
__global__ void serial(const float* input, const int N, float* result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;

    //parallel part
    int size = (int) input[id];

    //serial output
    for (int i = 0; i < N; i++)
    {
        int localcount = count;
        if (localcount == id)
        {
            int localpos = pos;
            for (int j = 0; j < size; j++)
            {
                result[localpos + j] = (float) j + 1;
            }
            pos = localpos + size;
            count = localcount + 1;
            __threadfence();
        }
        while (count == localcount)
        {
            __syncthreads();
        };

    }
}

int main()
{
    int N = 6;
    thrust::device_vector<float> input(
            thrust::counting_iterator<float>(1),
            thrust::counting_iterator<float>(1) + N);

    thrust::device_vector<float> result(N * (N + 1) / 2);
    serial<<<2, 3>>>(
            thrust::raw_pointer_cast(&input[0]),
            N,
            thrust::raw_pointer_cast(&result[0]));

    thrust::copy(
            result.begin(), result.end(),
            std::ostream_iterator<float>(std::cout, " "));

    return 0;

}

output as expected:

1 1 2 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 

Upvotes: 0

1-----1
1-----1

Reputation: 1433

Conceptually how you would do a concurrent output using an shared array to store the indexes for each thread.

__global__ myThreadedFunction(float *int, float *out)
{

    __shared__ index[blockDim.x];//replace the size with an constant
    // do calculations with in values
    ...



    index[tid] = size;// assuming size is the size of the array you output
    //you could do a reduction on this for loop for better performance.
    for(int i = 1; i < blockDim.x; ++i) {
        __syncthreads();
        if(tid == i) {
            index[tid] += index[tid-1];
        }
    }
    int startposition = index[tid] - size; // you want to start at the start, not where the index ends

    //do your output for all threads concurrently where startposition is the first index you output to

}

So what you do is assign index[tid] to the size you want to output, where tid is the thread index threadIdx.x, then do a summation uppwards the array(increasing index), and then finally index[tid] is the offset starting index in your output array from thread 0. The summation could easily be done using reduction.

Upvotes: 2

Related Questions