Alankrit
Alankrit

Reputation: 748

CUDA shows error as "Invalid Argument" for matrix - multiplication of N times

I am trying to multiply matrix A (n times) with matrix B. I have used kernel for matrix multiplication and using stream to do this multiplication N times. I have 3 conditions to test consequently. My 1st condition is running successfully.

I don't know why it is showing error of "Invalid Argument" in the second condition iteration. I am guessing the I am not properly cleaning my memory. I have done my best to free all host and device variables. Also tried CUDA device reset, nothing helps. Can anyone help me debug this?

Please find the portion of my code here:

int main(){
    
    
    for (int i = 0; i < 3; i++) {
        
      
      for (int ind = 0; ind < itr; ind++){
          cudaStreamCreate(&(stream[ind]));
      }
      cudaCheckErrors("cudaStreamCreate fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind]);
      }
      cudaDeviceSynchronize();

      for (int ind = 0; ind < itr; ind++){
          // Launch our kernel
          matrixMul<<<BLOCKS, THREADS, 0, stream[ind]>>>(d_a[ind], b, d_c[ind], M, K, N);
      }
      cudaDeviceSynchronize();
      cudaCheckErrors("kernel fail");

      for (int ind = 0; ind < itr; ind++){
          cudaMemcpyAsync(h_c[ind], d_c[ind], bytes_c, cudaMemcpyDeviceToHost, stream[ind]);
      }

      for (int ind = 0; ind < itr; ind++){
          cudaStreamSynchronize(stream[ind]);
      }
        
      cudaEventRecord( stop, 0 );
      cudaEventSynchronize( stop );

      cudaEventDestroy( start );
      cudaEventDestroy( stop);

      // Free allocated memory ****The issue was here.******
      cudaFreeHost(h_a);
      cudaFree(b);
      cudaFreeHost(h_c);
      cudaFree(d_a);
      cudaFree(d_c);
      cudaDeviceReset();
    }

    return 0;
}

In second iteration I was getting error as:

Fatal error: cudaStreamCreate fail (invalid argument at /tmp/tmpwgpzgk9m/73a7502c-7662-4e80-804e-4debff15dc45.cu:140)
*** FAILED - ABORTING

SOlved:

The error was coming due to memory leakage. I was allocating the array pointers but was only freeing 1st one. As per suggestion from below answer from Robert, the memory should be for each index of the array. And also please always use proper error in cuda like this

.

Upvotes: 0

Views: 916

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152143

Suggestion: Implement proper CUDA error checking. Use it on every cuda call. Your haphazard use of the error checking macro makes for a confusing output that seems to suggest a problem with stream creation.

That is not the case. The invalid argument is arising from your freeing operations at the end of the loop. You have a number of errors:

  1. We don't don't use cudaFreeHost on a pointer returned by malloc, or on a pointer that is actually a stack array.
  2. You don't use cudaFree on a pointer that is actually a stack array.
  3. If you have done allocations in a loop, you are likely going to have to do free operations in a loop.
  4. Even with your use of cudaDeviceReset (which frees all device allocations anyway), you have a memory leak because of improper freeing of the malloc allocations.

By modifying the end of your code as follows:

  ...
  cudaEventDestroy( start );
  cudaEventDestroy( stop);

  for (int ind = 0; ind < itr; ind++){
      free(h_a[ind]);
      free(h_c[ind]);
      cudaFree(d_a[ind]);
      cudaFree(d_c[ind]);
  }
  // Free allocated memory
  cudaFree(b);
  cudaDeviceReset();
}
...

I was able to make the above errors disappear.

As an aside, it should not be necessary to create 5000 streams, but it appears to work so I'll leave it at that. I would normally advise stream reuse.

Stream reuse could look something like this. Instead of creating 5000 streams, pick a smaller number, like 5 (the exact number shouldn't matter much here. It's likely that anything in the range of 3 or greater will behave similarly).

  1. Create that many streams:

       const int max_streams = 5;
       for (int ind = 0; ind < max_streams; ind++){
           cudaStreamCreate(&(stream[ind]));
       }
    
  2. When it comes to using the streams, use modulo arithmetic to "rotate" through the streams:

     for (int ind = 0; ind < itr; ind++){
       cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind%max_streams]);
     }
     cudaDeviceSynchronize();
    
     for (int ind = 0; ind < itr; ind++){
         // Launch our kernel
       matrixMul<<<BLOCKS, THREADS, 0, stream[ind%max_streams]>>>(d_a[ind], b, d_c[ind], M, K, N);
     }
     cudaDeviceSynchronize();
    ...
    

Upvotes: 1

Related Questions