Reputation: 551
I am trying to make an LCG Random Number Generator run in parallel using CUDA & GPU's. However, I am having trouble actually getting multiple threads running at the same time.Here is a copy of the code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont)
{
int a=9, c=3, F, X=1;
long M=524288, Y;
printf("\nKernel X is %d\n", X[0]);
F=X;
Y=X;
printf("Kernel F is %d\nKernel Y is %d\n", F, Y);
Y=(a*Y+c)%M;
printf("%ld\t", Y);
while(Y!=F)
{
Y=(a*Y+c)%M;
printf("%ld\t", Y);
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int X[1];
long *dev_cont;
int *dev_X;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
printf("Please give the value of the seed X ");
scanf("%d", &X[0]);
printf("Host X is: %d", *X);
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_X, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_X, X, 1 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,1>>>(dev_cont);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
cudaFree(dev_cont);
cudaFree(dev_X);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
Right now I am only sending one block with one thread. However, if I send 100 threads, the only thing that will happen is that it will produce the same number 100 times and then proceed to the next number. In theory this is what is meant to be expected but it automatically disregards the purpose of "random numbers" when a number is repeated.
The idea I want to implement is to have multiple threads. One thread will use that formula: Y=(a*Y+c)%M but using an initial value of Y=1, then another thread will use the same formula but with an initial value of Y=1000, etc etc. However, once the first thread produces 1000 numbers, it needs to stop making more calculations because if it continues it will interfere with the second thread producing numbers with a value of Y=1000.
If anyone can point in the right direction, at least in the way of creating multiple threads with different functions or instructions inside of them, to run in parallel, I will try to figure out the rest.
Thanks!
I updated my code to the following. Basically I am trying to produce 256 random numbers. I created the array where those 256 numbers will be stored. I also created an array with 10 different seed values for the values of Y in the threads. I also changed the code to request 10 threads in the device. I am also saving the numbers that are generated in an array. The code is not working correctly as it should. Please advise on how to fix it or how to make it achieve what I want.
Thanks!
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=N[threadIdx.x];
int a=9, c=3, i;
long M=256;
for(i=0;i<256;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
@Bardia - Please let me know how I can change my code to accommodate my needs.
I edited my code to accommodate @Bardia's modifications to the Kernel code. However a few errors in the generation of numbers are coming out. First, the counter that I created in the kernel to count the amount of numbers that are being created, is not working. At the end it only displays that "1" number was generated. The Timer that I created to measure the time it takes for the kernel to execute the instructions is also not working because it keeps displaying 0.00 ms. And based on the parameters that I have set for the formula, the numbers that are being generated and copied into the array and then printed on the screen do not reflect the numbers that are meant to appear (or even close). These all used to work before.
Here is the new code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
This is the output I receive:
[wigberto@client2 CUDA]$ ./RNG8
You generated a total of 1 numbers
CUDA Kernel Time: 0.00 ms
Your numbers are:614350480 32767 1132936976 11079 2 0 10 0 1293351837 0 -161443660 48 0 0 614350336 32767 1293351836 0 -161444681 48 614350760 32767 1132936976 11079 2 0 10 0 1057178751 0 -161443660 48 155289096 49 614350416 32767 1057178750 0 614350816 32767 614350840 32767 155210544 49 0 0 1132937352 11079 1130370784 11079 1130382061 11079 155289096 49 1130376992 11079 0 1 1610 1 1 1 1130370408 11079 614350896 32767 614350816 32767 1057178751 0 614350840 32767 0 0 -161443150 48 0 0 1132937352 11079 1 11079 0 0 1 0 614351008 32767 614351032 32767 0 0 0 0 0 0 1130369536 1 1132937352 11079 1130370400 11079 614350944 32767 1130369536 11079 1130382061 11079 1130370784 11079 1130365792 11079 6143510880 614351008 32767 -920274837 0 614351032 32767 0 0 -161443150 48 0 0 0 0 1 0 128 0-153802168 48 614350896 32767 1132839104 11079 97 0 88 0 1 0 155249184 49 1130370784 11079 0 0-1 0 1130364928 11079 2464624 0 4198536 0 4198536 0 4197546 0 372297808 0 1130373120 11079 -161427611 48 111079 0 0 1 0 -153802272 48 155249184 49 372297840 0 -1 0 -161404446 48 0 0 0 0372298000 0 372297896 0 372297984 0 0 0 0 0 1130369536 11079 84 0 1130471067 11079 6303744 0614351656 32767 0 0 -1 0 4198536 0 4198536 0 4197546 0 1130397880 11079 0 0 0 0 0 0 00 0 0 -161404446 48 0 0 4198536 0 4198536 0 6303744 0 614351280 32767 6303744 0 614351656 32767 614351640 32767 1 0 4197371 0 0 0 0 0 [wigberto@client2 CUDA]$
@Bardia - Please advise on what is the best thing to do here.
Thanks!
Upvotes: 0
Views: 1012
Reputation: 393
This answer relates to the edited part of the question.
I didn't notice that it is a recursive Algorithm and unfortunately I don't know how to parallelize a recursive algorithm.
My only idea for generating these 256 number is to generate them separately. i.e. generate 26 of them in the first thread, 26 of them on the second thread and so on. This code will do this (this is only kernel part):
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
Upvotes: 0
Reputation: 393
It should do the same work in all threads, because you want them to do the same work. You should always distinguish threads from each other with addressing them.
For example you should say thread #1 you do this job and save you work here and thread #2 you do that job and save your work there and then go to Host and use that data.
For a two dimensional block grid with two dimension threads in each block I use this code for addressing:
int X = blockIdx.x*blockDim.x+threadIdx.x;
int Y = blockIdx.y*blockDim.y+threadIdx.y;
The X
and Y
in the code above are the global address of your thread (I think for your a one dimensional grid and thread is sufficient).
Also remember that you can not use the printf
function on the kernel. The GPUs can't make any interrupt. For this you can use cuPrintf
function which is one of CUDA SDK's samples, but read it's instructions to use it correctly.
Upvotes: 0
Reputation:
You can address threads within a block by threadIdx
variable.
ie., in your case you should probably set
Y = threadIdx.x
and then use Y=(a*Y+c)%M
But in general implementing a good RNG on CUDA could be really difficult. So I don't know if you want to implement your own generator just for practice..
Otherwise there is a CURAND library available which provides a number of pseudo- and quasi-random generators, ie. XORWOW, MersenneTwister, Sobol etc.
Upvotes: 2