Sandeep Nagaraj
Sandeep Nagaraj

Reputation: 2452

A simple Code about CUDA Warps

I have read in the Cuda Documentaion that , inside each block threads are executed in a batch of 32 called warps, each thread points at same instruction but multiple data can be accessed, my quest was to test out the authenticity of the statement.

Now what i did is i launched a kernel with 256 threads and a single block, so 8 batches of warps must be executed.

I shall create a shared variable of size 32, assign it to

sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;

and then assign that variable to global variable of 256 byte length:

outputPointer[ threadIdx.x  ] = sharedVariable [ threadIdx.x % 32 ];

Ideally, according to the assumption i should get output as

0,0,0,0,0,0,0,0,till 32 1,1,1,1,1,1till 32.. 2,2,2,2,2,till 32

but i am getting output as just 4,4,4,4,4

Cuda Code:

__global__ void addKernel(int *inputPointer, int *outputPointer)
{
    __shared__ int sharedVariable[ 32 ];

    sharedVariable [ threadIdx.x % 32 ] = 0 ;
    sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;


    outputPointer[ threadIdx.x  ] = sharedVariable [ threadIdx.x % 32 ];
}

int main () { 
......
 addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);
......
/**Print output here */
//I am getting 4 ,4,4,4,4,4,4,4,4 as output
}

Complete Code:

#include "cuda_runtime.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <conio.h>
#define SIZE  256 * sizeof(int)


__global__ void addKernel(int *inputPointer, int *outputPointer)
{
    __shared__ int sharedVariable[ 32 ];


    sharedVariable [ threadIdx.x % 32 ] = 0;

    sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;


    outputPointer[ threadIdx.x  ] = sharedVariable [ threadIdx.x % 32 ];
}



int main()
{





    // Copy input vectors from host memory to GPU buffers.
    int *inputPointer = (int * ) malloc (SIZE);
    int *outputPointer= (int * ) malloc (SIZE);


    int *device_inputPointer;
    int *device_outputPointer;

    cudaMalloc((void**)&device_inputPointer, SIZE);
    cudaMalloc((void**)&device_outputPointer, SIZE);

    memset (inputPointer  , 0  ,  SIZE);
    cudaMemcpy(device_inputPointer , inputPointer,  SIZE , cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);

    cudaMemcpy(outputPointer, device_outputPointer, SIZE , cudaMemcpyDeviceToHost);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.

    for ( int i = 0 ; i < 256  ; i ++ ) {
        printf ( " %d  " , outputPointer[i] );
    }

    cudaDeviceReset();



    getch();
    return 0;
}

#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <conio.h>
#define SIZE  256 * sizeof(int)


__global__ void addKernel(int *inputPointer, int *outputPointer)
{
    __shared__ int sharedVariable[ 32 ];


    sharedVariable [ threadIdx.x % 32 ] = 0;

    sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;


    outputPointer[ threadIdx.x  ] = sharedVariable [ threadIdx.x % 32 ];
}



int main()
{





    // Copy input vectors from host memory to GPU buffers.
    int *inputPointer = (int * ) malloc (SIZE);
    int *outputPointer= (int * ) malloc (SIZE);


    int *device_inputPointer;
    int *device_outputPointer;

    cudaMalloc((void**)&device_inputPointer, SIZE);
    cudaMalloc((void**)&device_outputPointer, SIZE);

    memset (inputPointer  , 0  ,  SIZE);
    cudaMemcpy(device_inputPointer , inputPointer,  SIZE , cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);

    cudaMemcpy(outputPointer, device_outputPointer, SIZE , cudaMemcpyDeviceToHost);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.

    for ( int i = 0 ; i < 256  ; i ++ ) {
        printf ( " %d  " , outputPointer[i] );
    }

    cudaDeviceReset();



    getch();
    return 0;
}

I tested it on different hardware on K20 (Tesla Architecture, it is working fine )

Upvotes: 0

Views: 275

Answers (2)

sgarizvi
sgarizvi

Reputation: 16796

The code has undefined behavior in the following lines:

 sharedVariable [ threadIdx.x % 32 ] = 0;
 sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;

Multiple threads can have same value of threadIdx.x % 32 and these threads would try to write to the same shared memory location simultaneously. This will cause a race condition between these threads.

For example, consider the threads with threadIdx.x 0, 32, 64, 96...etc. All of these threads will try to access the index 0 of the sharedVariable thus causing undefined behavior. Same is the case with all threads which are at an offset of 32 from each other (In current example only).

Upvotes: 2

hubs
hubs

Reputation: 1809

I think it's not fully clear to you how cuda code is executed in parallel. The line sharedVariable [ threadIdx.x % 32 ] = 0 ; is complete useless because in the next line you overwrite it with threadIdx.x /32.

Also your assumption of the warp count is wrong. Your threadblock consists of 256 threads. So there are 8 warps (256/32).

You cannot expect any results from your code, because there is no defined behaivour! The results of threadIdx.x /32 wil be in range from 0 to 7, depending on threadIdx.x that is in range 0..255. Because there are 8 warps of each 32 threads, there will be 8 writes to sharedVariable [ threadIdx.x % 32 ] and you have no control, which warp will be executed first and last.

In your case warp 4 was executed at last and therefore your results are only 4's.

To achieve the results that you expect from your kernel, it canbe changed to:

__global__ void addKernel(int *outputPointer)
{
    outputPointer[ threadIdx.x  ] = threadIdx.x /32;
}

I see no opportunity for the use of shared memory like you want to.

Upvotes: 2

Related Questions