Reputation: 2452
I have read in the Cuda Documentaion that , inside each block threads are executed in a batch of 32 called warps, each thread points at same instruction but multiple data can be accessed, my quest was to test out the authenticity of the statement.
Now what i did is i launched a kernel with 256 threads and a single block, so 8 batches of warps must be executed.
I shall create a shared variable of size 32, assign it to
sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;
and then assign that variable to global variable of 256 byte length:
outputPointer[ threadIdx.x ] = sharedVariable [ threadIdx.x % 32 ];
Ideally, according to the assumption i should get output as
0,0,0,0,0,0,0,0,till 32 1,1,1,1,1,1till 32.. 2,2,2,2,2,till 32
but i am getting output as just 4,4,4,4,4
Cuda Code:
__global__ void addKernel(int *inputPointer, int *outputPointer)
{
__shared__ int sharedVariable[ 32 ];
sharedVariable [ threadIdx.x % 32 ] = 0 ;
sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;
outputPointer[ threadIdx.x ] = sharedVariable [ threadIdx.x % 32 ];
}
int main () {
......
addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);
......
/**Print output here */
//I am getting 4 ,4,4,4,4,4,4,4,4 as output
}
Complete Code:
#include "cuda_runtime.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <conio.h>
#define SIZE 256 * sizeof(int)
__global__ void addKernel(int *inputPointer, int *outputPointer)
{
__shared__ int sharedVariable[ 32 ];
sharedVariable [ threadIdx.x % 32 ] = 0;
sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;
outputPointer[ threadIdx.x ] = sharedVariable [ threadIdx.x % 32 ];
}
int main()
{
// Copy input vectors from host memory to GPU buffers.
int *inputPointer = (int * ) malloc (SIZE);
int *outputPointer= (int * ) malloc (SIZE);
int *device_inputPointer;
int *device_outputPointer;
cudaMalloc((void**)&device_inputPointer, SIZE);
cudaMalloc((void**)&device_outputPointer, SIZE);
memset (inputPointer , 0 , SIZE);
cudaMemcpy(device_inputPointer , inputPointer, SIZE , cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);
cudaMemcpy(outputPointer, device_outputPointer, SIZE , cudaMemcpyDeviceToHost);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
for ( int i = 0 ; i < 256 ; i ++ ) {
printf ( " %d " , outputPointer[i] );
}
cudaDeviceReset();
getch();
return 0;
}
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <conio.h>
#define SIZE 256 * sizeof(int)
__global__ void addKernel(int *inputPointer, int *outputPointer)
{
__shared__ int sharedVariable[ 32 ];
sharedVariable [ threadIdx.x % 32 ] = 0;
sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;
outputPointer[ threadIdx.x ] = sharedVariable [ threadIdx.x % 32 ];
}
int main()
{
// Copy input vectors from host memory to GPU buffers.
int *inputPointer = (int * ) malloc (SIZE);
int *outputPointer= (int * ) malloc (SIZE);
int *device_inputPointer;
int *device_outputPointer;
cudaMalloc((void**)&device_inputPointer, SIZE);
cudaMalloc((void**)&device_outputPointer, SIZE);
memset (inputPointer , 0 , SIZE);
cudaMemcpy(device_inputPointer , inputPointer, SIZE , cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, 256>>>(device_inputPointer, device_outputPointer);
cudaMemcpy(outputPointer, device_outputPointer, SIZE , cudaMemcpyDeviceToHost);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
for ( int i = 0 ; i < 256 ; i ++ ) {
printf ( " %d " , outputPointer[i] );
}
cudaDeviceReset();
getch();
return 0;
}
I tested it on different hardware on K20 (Tesla Architecture, it is working fine )
Upvotes: 0
Views: 275
Reputation: 16796
The code has undefined behavior in the following lines:
sharedVariable [ threadIdx.x % 32 ] = 0;
sharedVariable [ threadIdx.x % 32 ] = threadIdx.x /32;
Multiple threads can have same value of threadIdx.x % 32
and these threads would try to write to the same shared memory location simultaneously. This will cause a race condition between these threads.
For example, consider the threads with threadIdx.x
0, 32, 64, 96...etc. All of these threads will try to access the index 0
of the sharedVariable
thus causing undefined behavior. Same is the case with all threads which are at an offset of 32
from each other (In current example only).
Upvotes: 2
Reputation: 1809
I think it's not fully clear to you how cuda code is executed in parallel.
The line sharedVariable [ threadIdx.x % 32 ] = 0 ;
is complete useless because in the next line you overwrite it with threadIdx.x /32
.
Also your assumption of the warp count is wrong. Your threadblock consists of 256 threads. So there are 8 warps (256/32).
You cannot expect any results from your code, because there is no defined behaivour! The results of threadIdx.x /32
wil be in range from 0 to 7, depending on threadIdx.x
that is in range 0..255.
Because there are 8 warps of each 32 threads, there will be 8 writes to sharedVariable [ threadIdx.x % 32 ]
and you have no control, which warp will be executed first and last.
In your case warp 4 was executed at last and therefore your results are only 4's.
To achieve the results that you expect from your kernel, it canbe changed to:
__global__ void addKernel(int *outputPointer)
{
outputPointer[ threadIdx.x ] = threadIdx.x /32;
}
I see no opportunity for the use of shared memory like you want to.
Upvotes: 2