Leonard Chung
Leonard Chung

Reputation: 161

CUDA error message : invalid configuration argument

I'm a new to CUDA. I tried to take a simple practice to help me to get familiar to it. I coded a little program "Finding Prime Numbers". It almost done, but there is a problem which I cannot solve it. I found that the maximum number which I can find is 1027. If I input more than 1027, I will get the error message:

getPrimeKernel launch failed!!: invalid configuration argument
findPrimeWithCuda failed!!

Where should I adjust my code? Thank you.

Here is my code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <vector>

using namespace std;

cudaError_t findPrimeWithCuda(bool *c, int *a, unsigned int size);

__host__ __device__ bool checkPrime(int i)
{
    for (int m = 2; m <= i - 1; m++)
    {
        if (i%m == 0) return true;
    }
    return false;
}

__global__ void getPrimeKernel(bool *c, int *a)
{
    int i = threadIdx.x;
    c[i] = checkPrime(a[i]);
}

void cudaGetPrime(int i)
{
    i = i - 3;
    int *arr = (int *)malloc((size_t)(i * sizeof(int)));
    bool *rst = (bool *)malloc((size_t)(i * sizeof(bool)));
    for (int j = 0; j <= i; j++) arr[j] = j + 3;
    cudaError_t cudaStatus = findPrimeWithCuda(rst, arr, i);
    if (cudaStatus != cudaSuccess) fprintf(stderr,"findPrimeWithCuda failed!!");
}

void w_CudaArray(int lastNum)
{
    time_t t1 = time(NULL);
    cudaGetPrime(lastNum);
    time_t t2 = time(NULL);
    printf("Time to spent : %d second\n", t2 - t1);
    cout << "Computing with CUDA to count the prime numbers ends!!" << endl << endl;
}

int main()
{
    int lastNum = 0;
    cout << "The final number which you want to find the prime numbers : ";
    cin >> lastNum;
    w_CudaArray(lastNum);
}

cudaError_t findPrimeWithCuda(bool *c, int *a, unsigned int size)
{
    int *dev_a = 0;
    bool *dev_c = false;
    cudaError_t cudaStatus;

    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaSetDevice failed!!");
        goto Error;
    }

    size_t totalm, freem;
    float free_m, total_m, used_m;
    cudaMemGetInfo(&freem, &totalm);
    free_m = (size_t)freem / 1048576.0;
    total_m = (size_t)totalm / 1048576.0;
    used_m = total_m - free_m;
    cout << "Total memory = " << total_m << " MB" << endl;
    cout << "Used memory  = " << used_m << " MB" << endl;
    cout << "Free memory  = " << free_m << " MB" << endl;

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaMalloc dev_a failed!!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(bool));
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaMalloc dev_c failed!!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaMemcpy dev_a failed!!");
        goto Error;
    }

    getPrimeKernel<<<1, size>>>(dev_c, dev_a);

    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "getPrimeKernel launch failed!!: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaDeviceSynchorinze returned error code %d after launching getPrimeKernel!\n", cudaStatus);
        goto Error;
    }

    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(bool), cudaMemcpyDeviceToHost);

    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaMemory failed!");
        goto Error;
    }

    int trueNumber = 0;
    for (int i = 0; i < size; i++)
    {
        if (c[i] == false) trueNumber++;
    }

    cout << "There are " << trueNumber + 2 << " prime numbers!!" << endl;

    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "cudaDeviceReset failed!!");
    }

Error:
    cudaFree(dev_c);
    cudaFree(dev_a);

    return cudaStatus;
}

Upvotes: 0

Views: 6285

Answers (2)

Leonard Chung
Leonard Chung

Reputation: 161

I modified some part of my code and the problem was solved.

__global__ void getPrimeKernel(bool *c, int *a, int size)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i >= size) return;
    c[i] = checkPrime(a[i]);
}

and this

cudaDeviceProp myCUDA;
if (cudaGetDeviceProperties(&myCUDA, 0) == cudaSuccess)
{
    printf("Using device %d:\n", 0);
    printf("%s; global mem: %dB; compute v%d.%d; clock: %d kHz\n",
        myCUDA.name, (int)myCUDA.totalGlobalMem, (int)myCUDA.major,
        (int)myCUDA.minor, (int)myCUDA.clockRate);
}

int threadsPerBlock = myCUDA.maxThreadsPerBlock;
int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
cout << "Maxium number per block = " << threadsPerBlock << endl;
cout << "Blocks per Grid = " << blocksPerGrid << endl;

getPrimeKernel<<<blocksPerGrid, threadsPerBlock>>>(dev_c, dev_a, size);

Now it can compute the numbers which are more than 1027. Hope this part of code can help others who is new like me. :)

Upvotes: 0

Marco13
Marco13

Reputation: 54611

You're calling the kernel with size as the number of threads per block (see the Programming Guide section on Kernel invocation syntax). But the maximum number of threads per block is limited, depending on the Compute Capability of the device. You probably have a device with a maximum number of 1024 threads per block. So with a larger number (1027 in your case), it will no longer work. The maximum number of threads per block, depending on the device, is listed in this table on Wikipedia.

You can query the maximum number of threads per block with cudaDeviceGetProperties, and look at the maxThreadsPerBlock field.

In order to handle input sizes that are larger than the number of threads per block, you need ... more blocks! This means that you have to use a larger grid size as well. So you'll have to compute the appropriate grid- and block size for your input, and pass these as parameters to the kernel launch.

For example, like the following:

int inputSize = ...; // The size of the input data
int threadsPerBlock = 256; // May be queried from the device properties
int blocksPerGrid = (inputSize + threadsPerBlock - 1) / threadsPerBlock;

callKernel<<<blocksPerGrid, threadsPerBlock>>>(...);

This only shows the basic idea. For details, you should refer to the CUDA Programming Guide or the available samples. For example, you also have to use the global thread index, and make sure that you do not access invalid memory regions. This can be achieved by passing the length of the input array to the kernel, and check whether you're still in the bounds:

__global__ void callKernel(int *array, int arrayLength) {
    // Make sure to compute the GLOBAL thread index like this:
    int i = blockDim.x*blockIdx.x + threadIdx.x
    if (i >= arrayLength) {
        return;
    }
    ....
}

(An aside: For practicing, anything may be fine, but I'm not sure whether prime number finding in this form is really well-suited for CUDA. You might consider having a look at the more "classic" CUDA examples, like vector addition or matrix multiplication...)

Upvotes: 3

Related Questions