Reputation: 113
I am new to Cuda programming and I have been trying to figure out how to convert the following code into Cuda code.
for (int i = 0; i <= N; i += M)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
following a vector_add example, I was able to get this:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
And this is where I am stuck. In my understanding, all thread/units would calculate in parallel, so I wasn't sure where to inform the iterator to skip every M elements in Cuda. An alternative I thought of was to check i % M == 0
. But I'd like to see if there is anything else I should know first to tackle this problem, such as thread syncing and etc.
Any help is appreciated.
Upvotes: 1
Views: 165
Reputation: 152173
Something like this should work:
__global__ void dec(const complex * signal, int N, int M, complex * output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
i *= M; // add this line
if (i <= N)
{
output[i].x = signal[i].x;
output[i].y = signal[i].y;
}
You should also make sure that you don't overflow the int
variable. This should be possible to manage by not launching unnecessary threads, i.e. don't launch a grid of significantly more than N/M
threads.
Upvotes: 2