Leios
Leios

Reputation: 1

CUFFT_INVALID_DEVICE when creating cufft plan on HPC

I am testing the following code on my own local machines (both on Archlinux and on Ubuntu 16.04 using nvidia driver 390 and cuda 9.1) and on our local HPC clusters:

#include <iostream>
#include <cufft.h>

int main(){
    // Initializing variables
    int n = 1024;
    cufftHandle plan1d;
    double2 *h_a, *d_a;

    // Allocation / definitions
    h_a = (double2 *)malloc(sizeof(double2)*n);
    for (int i = 0; i < n; ++i){
        h_a[i].x = sin(2*M_PI*i/n);
        h_a[i].y = 0;
    }

    cudaMalloc(&d_a, sizeof(double2)*n);
    cudaMemcpy(d_a, h_a, sizeof(double2)*n, cudaMemcpyHostToDevice);
    cufftResult result = cufftPlan1d(&plan1d, n, CUFFT_Z2Z, 1);

    // ignoring full error checking for readability
    if (result == CUFFT_INVALID_DEVICE){
        std::cout << "Invalid Device Error\n";
        exit(1);
    }

    // Executing FFT
    cufftExecZ2Z(plan1d, d_a, d_a, CUFFT_FORWARD);

    //Executing the iFFT
    cufftExecZ2Z(plan1d, d_a, d_a, CUFFT_INVERSE);

    // Copying back
    cudaMemcpy(h_a, d_a, sizeof(double2)*n, cudaMemcpyDeviceToHost);

 }

I compile with nvcc cuda_test.cu -lcufft

On both of my local machines, the code works just fine; however, I have tried using the same code on our HPC clusters and it will return the CUFFT_INVALID_DEVICE error on that hardware / configuration. Here's the hardware and driver configuration for those devices.

According to this, the cuda versions should be fine with the driver versions available; however, I receive a similar error when I had my drivers and cuda installations incorrect on my local ubuntu machine before.

I am completely baffled at how to continue here and can only think of a few things:

  1. There is some difference between the consumer hardware I am using on my local machines (Titan X, pascal and GTX 970) and the cluster HPC hardware.
  2. There is some driver configuration problem that I have not considered. I did what I could to try out different cuda versions, but none of them seemed to work, except for 7.5.18, which returned the same error, but did not seem to affect performance.
  3. There is some change to cufft after cuda 7.5.18 that I was not made aware of.

As a note: this is just an example, but I have a larger codebase that does not seem to run due to this error and I am trying to figure out how to solve that issue currently.

Thanks for reading and let me know if you have any ideas on how to proceed!

EDIT -- added a comment and fixed a typo in main code, after Rob's comment.

Upvotes: 0

Views: 547

Answers (1)

Igor Sfiligoi
Igor Sfiligoi

Reputation: 11

I have had a similar problem, and it turned out to be a conflict between the Cray wrappers and the cuda toolkit. Not loading the cudatoolkit module, enabling dynamic linking and using the compiler-provided libraries solved the problem.

PS: I am using PGI Fortran 17.5, so not an exact match.

Upvotes: 1

Related Questions