DoaJC_Blogger
DoaJC_Blogger

Reputation: 21

CUDA kernel returns nothing

I'm using CUDA Toolkit 8 with Visual Studio Community 2015. When I try simple vector addition from NVidia's PDF manual (minus error checking which I don't have the *.h's for) it always comes back as undefined values, which means the output array was never filled. When I pre-fill it with 0's, that's all I get at the end.

Others have had this problem and some people are saying it's caused by compiling for the wrong compute capability. However, I am using an NVidia GTX 750 Ti, which is supposed to be Compute Capability 5. I have tried compiling for Compute Capability 2.0 (the minimum for my SDK) and 5.0.

I also cannot make any of the precompiled examples work, such as vectoradd.exe which says, "Failed to allocate device vector A (error code initialization error)!" And oceanfft.exe says, "Error unable to find GLSL vertex and fragment shaders!" which doesn't make sense because GLSL and fragment shading are very basic features.

My driver version is 361.43 and other apps such as Blender Cycles in CUDA mode and Stellarium work perfectly.

Here is the code that should work:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <iostream>
#include <algorithm>
#define N 10

__global__ void add(int *a, int *b, int *c) {
    int tid = blockIdx.x; // handle the data at this index
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

int main(void) {
    int a[N], b[N], c[N];
    int *dev_a, *dev_b, *dev_c;
    // allocate the memory on the GPU
    cudaMalloc((void**)&dev_a, N * sizeof(int));
    cudaMalloc((void**)&dev_b, N * sizeof(int));
    cudaMalloc((void**)&dev_c, N * sizeof(int));
    // fill the arrays 'a' and 'b' on the CPU
    for (int i = 0; i<N; i++) {
        a[i] = -i;
        b[i] = i * i;
    }
    // copy the arrays 'a' and 'b' to the GPU
    cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);
    add << <N, 1 >> >(dev_a, dev_b, dev_c);
    // copy the array 'c' back from the GPU to the CPU
    cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
    // display the results
    for (int i = 0; i<N; i++) {
        printf("%d + %d = %d\n", a[i], b[i], c[i]);
    }
    // free the memory allocated on the GPU
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);
    return 0;
}

I'm trying to develop CUDA apps so any help would be greatly appreciated.

Upvotes: 2

Views: 845

Answers (1)

talonmies
talonmies

Reputation: 72352

This was apparently caused by using an incompatible driver version with the CUDA 8 toolkit. Installing the driver distributed with the version 8 toolkit solved thr problem.

[Answer assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]

Upvotes: 1

Related Questions