Wilo Maldonado
Wilo Maldonado

Reputation: 551

CUDA - Multiple Threads

I am trying to make an LCG Random Number Generator run in parallel using CUDA & GPU's. However, I am having trouble actually getting multiple threads running at the same time.Here is a copy of the code:

#include <iostream>
#include <math.h>

__global__ void rng(long *cont)
{

    int a=9, c=3, F, X=1; 
    long M=524288, Y;     
    printf("\nKernel X is %d\n", X[0]);     
    F=X;
    Y=X;
    printf("Kernel F is %d\nKernel Y is %d\n", F, Y);
    Y=(a*Y+c)%M;
    printf("%ld\t", Y);
    while(Y!=F)
    {
        Y=(a*Y+c)%M;
        printf("%ld\t", Y);
    cont[0]++;
    }
}
int main()
{
    long cont[1]={1};
    int X[1];
    long *dev_cont;
    int *dev_X;
    cudaEvent_t beginEvent;
    cudaEvent_t endEvent;
    cudaEventCreate( &beginEvent );
    cudaEventCreate( &endEvent );
    printf("Please give the value of the seed X ");
    scanf("%d", &X[0]);
    printf("Host X is: %d", *X);
    cudaEventRecord( beginEvent, 0);
    cudaMalloc( (void**)&dev_cont, sizeof(long) );
    cudaMalloc( (void**)&dev_X, sizeof(int) );
    cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_X, X, 1 * sizeof(int), cudaMemcpyHostToDevice);
    rng<<<1,1>>>(dev_cont);
    cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
    cudaEventRecord( endEvent, 0);
    cudaEventSynchronize (endEvent );
    float timevalue;
    cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
    printf("\n\nYou generated a total of %ld numbers", cont[0]);
    printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
    cudaFree(dev_cont);
    cudaFree(dev_X);
    cudaEventDestroy( endEvent );
    cudaEventDestroy( beginEvent );
    return 0;
}

Right now I am only sending one block with one thread. However, if I send 100 threads, the only thing that will happen is that it will produce the same number 100 times and then proceed to the next number. In theory this is what is meant to be expected but it automatically disregards the purpose of "random numbers" when a number is repeated.

The idea I want to implement is to have multiple threads. One thread will use that formula: Y=(a*Y+c)%M but using an initial value of Y=1, then another thread will use the same formula but with an initial value of Y=1000, etc etc. However, once the first thread produces 1000 numbers, it needs to stop making more calculations because if it continues it will interfere with the second thread producing numbers with a value of Y=1000.

If anyone can point in the right direction, at least in the way of creating multiple threads with different functions or instructions inside of them, to run in parallel, I will try to figure out the rest.

Thanks!

UPDATE: July 31, 8:14PM EST

I updated my code to the following. Basically I am trying to produce 256 random numbers. I created the array where those 256 numbers will be stored. I also created an array with 10 different seed values for the values of Y in the threads. I also changed the code to request 10 threads in the device. I am also saving the numbers that are generated in an array. The code is not working correctly as it should. Please advise on how to fix it or how to make it achieve what I want.

Thanks!

#include <iostream>
#include <math.h>

__global__ void rng(long *cont, int *L, int *N)
{

    int Y=threadIdx.x;
    Y=N[threadIdx.x];
    int a=9, c=3, i;
    long M=256;
    for(i=0;i<256;i++)
    {
        Y=(a*Y+c)%M;
        N[i]=Y;
        cont[0]++;
    }
}
int main()
{
    long cont[1]={1};
    int i;
    int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
    long *dev_cont;
    int *dev_L, *dev_N;
    cudaEvent_t beginEvent;
    cudaEvent_t endEvent;
    cudaEventCreate( &beginEvent );
    cudaEventCreate( &endEvent );
    cudaEventRecord( beginEvent, 0);
    cudaMalloc( (void**)&dev_cont, sizeof(long) );
    cudaMalloc( (void**)&dev_L, sizeof(int) );
    cudaMalloc( (void**)&dev_N, sizeof(int) );
    cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
    rng<<<1,10>>>(dev_cont, dev_L, dev_N);
    cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
    cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord( endEvent, 0);
    cudaEventSynchronize (endEvent );
    float timevalue;
    cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
    printf("\n\nYou generated a total of %ld numbers", cont[0]);
    printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
    printf("Your numbers are:");
    for(i=0;i<256;i++)
    {
        printf("%d\t", N[i]);
    }
    cudaFree(dev_cont);
    cudaFree(dev_L);
    cudaFree(dev_N);
    cudaEventDestroy( endEvent );
    cudaEventDestroy( beginEvent );
    return 0;
}

@Bardia - Please let me know how I can change my code to accommodate my needs.

UPDATE: August 1, 5:39PM EST

I edited my code to accommodate @Bardia's modifications to the Kernel code. However a few errors in the generation of numbers are coming out. First, the counter that I created in the kernel to count the amount of numbers that are being created, is not working. At the end it only displays that "1" number was generated. The Timer that I created to measure the time it takes for the kernel to execute the instructions is also not working because it keeps displaying 0.00 ms. And based on the parameters that I have set for the formula, the numbers that are being generated and copied into the array and then printed on the screen do not reflect the numbers that are meant to appear (or even close). These all used to work before.

Here is the new code:

#include <iostream>
#include <math.h>

__global__ void rng(long *cont, int *L, int *N)
{

    int Y=threadIdx.x;
    Y=L[threadIdx.x];
    int a=9, c=3, i;
    long M=256;
    int length=ceil((float)M/10); //256 divided by the number of threads.
    for(i=(threadIdx.x*length);i<length;i++)
    {
        Y=(a*Y+c)%M;
        N[i]=Y;
        cont[0]++;
    }
}
int main()
{
    long cont[1]={1};
    int i;
    int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
    long *dev_cont;
    int *dev_L, *dev_N;
    cudaEvent_t beginEvent;
    cudaEvent_t endEvent;
    cudaEventCreate( &beginEvent );
    cudaEventCreate( &endEvent );
    cudaEventRecord( beginEvent, 0);
    cudaMalloc( (void**)&dev_cont, sizeof(long) );
    cudaMalloc( (void**)&dev_L, sizeof(int) );
    cudaMalloc( (void**)&dev_N, sizeof(int) );
    cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
    rng<<<1,10>>>(dev_cont, dev_L, dev_N);
    cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
    cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord( endEvent, 0);
    cudaEventSynchronize (endEvent );
    float timevalue;
    cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
    printf("\n\nYou generated a total of %ld numbers", cont[0]);
    printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
    printf("Your numbers are:");
    for(i=0;i<256;i++)
    {
        printf("%d\t", N[i]);
    }
    cudaFree(dev_cont);
    cudaFree(dev_L);
    cudaFree(dev_N);
    cudaEventDestroy( endEvent );
    cudaEventDestroy( beginEvent );
    return 0;
}

This is the output I receive:

[wigberto@client2 CUDA]$ ./RNG8


You generated a total of 1 numbers
CUDA Kernel Time: 0.00 ms
Your numbers are:614350480      32767   1132936976      11079   2       0       10      0       1293351837      0       -161443660      48      0       0       614350336       32767    1293351836      0       -161444681      48      614350760       32767   1132936976      11079   2       0       10      0       1057178751      0       -161443660      48       155289096       49      614350416       32767   1057178750      0       614350816       32767   614350840       32767   155210544       49      0       0       1132937352       11079   1130370784      11079   1130382061      11079   155289096       49      1130376992      11079   0       1       1610    1       1       1       1130370408      11079    614350896       32767   614350816       32767   1057178751      0       614350840       32767   0       0       -161443150      48      0       0       1132937352      11079    1       11079   0       0       1       0       614351008       32767   614351032       32767   0       0       0       0       0       0       1130369536      1       1132937352       11079   1130370400      11079   614350944       32767   1130369536      11079   1130382061      11079   1130370784      11079   1130365792      11079   6143510880       614351008       32767   -920274837      0       614351032       32767   0       0       -161443150      48      0       0       0       0       1       0       128     0-153802168      48      614350896       32767   1132839104      11079   97      0       88      0       1       0       155249184       49      1130370784      11079   0       0-1      0       1130364928      11079   2464624 0       4198536 0       4198536 0       4197546 0       372297808       0       1130373120      11079   -161427611      48      111079   0       0       1       0       -153802272      48      155249184       49      372297840       0       -1      0       -161404446      48      0       0       0       0372298000       0       372297896       0       372297984       0       0       0       0       0       1130369536      11079   84      0       1130471067      11079   6303744 0614351656       32767   0       0       -1      0       4198536 0       4198536 0       4197546 0       1130397880      11079   0       0       0       0       0       0       00       0       0       -161404446      48      0       0       4198536 0       4198536 0       6303744 0       614351280       32767   6303744 0       614351656       32767   614351640        32767   1       0       4197371 0       0       0       0       0       [wigberto@client2 CUDA]$

@Bardia - Please advise on what is the best thing to do here.

Thanks!

Upvotes: 0

Views: 1012

Answers (3)

Bardia
Bardia

Reputation: 393

This answer relates to the edited part of the question.

I didn't notice that it is a recursive Algorithm and unfortunately I don't know how to parallelize a recursive algorithm.

My only idea for generating these 256 number is to generate them separately. i.e. generate 26 of them in the first thread, 26 of them on the second thread and so on. This code will do this (this is only kernel part):

#include <iostream>
#include <math.h>

__global__ void rng(long *cont, int *L, int *N)
{

    int Y=threadIdx.x;
    Y=L[threadIdx.x];
    int a=9, c=3, i;
    long M=256;
    int length=ceil((float)M/10); //256 divided by the number of threads.
    for(i=(threadIdx.x*length);i<length;i++)
    {
        Y=(a*Y+c)%M;
        N[i]=Y;
        cont[0]++;
    }
}

Upvotes: 0

Bardia
Bardia

Reputation: 393

It should do the same work in all threads, because you want them to do the same work. You should always distinguish threads from each other with addressing them.

For example you should say thread #1 you do this job and save you work here and thread #2 you do that job and save your work there and then go to Host and use that data.

For a two dimensional block grid with two dimension threads in each block I use this code for addressing:

int X = blockIdx.x*blockDim.x+threadIdx.x;
int Y = blockIdx.y*blockDim.y+threadIdx.y;

The X and Y in the code above are the global address of your thread (I think for your a one dimensional grid and thread is sufficient).

Also remember that you can not use the printf function on the kernel. The GPUs can't make any interrupt. For this you can use cuPrintf function which is one of CUDA SDK's samples, but read it's instructions to use it correctly.

Upvotes: 0

user1545642
user1545642

Reputation:

You can address threads within a block by threadIdx variable. ie., in your case you should probably set

Y = threadIdx.x and then use Y=(a*Y+c)%M

But in general implementing a good RNG on CUDA could be really difficult. So I don't know if you want to implement your own generator just for practice..

Otherwise there is a CURAND library available which provides a number of pseudo- and quasi-random generators, ie. XORWOW, MersenneTwister, Sobol etc.

Upvotes: 2

Related Questions