Reputation: 748
I am trying to multiply matrix A (n times) with matrix B. I have used kernel for matrix multiplication and using stream to do this multiplication N times. I have 3 conditions to test consequently. My 1st condition is running successfully.
I don't know why it is showing error of "Invalid Argument" in the second condition iteration. I am guessing the I am not properly cleaning my memory. I have done my best to free all host and device variables. Also tried CUDA device reset, nothing helps. Can anyone help me debug this?
Please find the portion of my code here:
int main(){
for (int i = 0; i < 3; i++) {
for (int ind = 0; ind < itr; ind++){
cudaStreamCreate(&(stream[ind]));
}
cudaCheckErrors("cudaStreamCreate fail");
for (int ind = 0; ind < itr; ind++){
cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind]);
}
cudaDeviceSynchronize();
for (int ind = 0; ind < itr; ind++){
// Launch our kernel
matrixMul<<<BLOCKS, THREADS, 0, stream[ind]>>>(d_a[ind], b, d_c[ind], M, K, N);
}
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
for (int ind = 0; ind < itr; ind++){
cudaMemcpyAsync(h_c[ind], d_c[ind], bytes_c, cudaMemcpyDeviceToHost, stream[ind]);
}
for (int ind = 0; ind < itr; ind++){
cudaStreamSynchronize(stream[ind]);
}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventDestroy( start );
cudaEventDestroy( stop);
// Free allocated memory ****The issue was here.******
cudaFreeHost(h_a);
cudaFree(b);
cudaFreeHost(h_c);
cudaFree(d_a);
cudaFree(d_c);
cudaDeviceReset();
}
return 0;
}
In second iteration I was getting error as:
Fatal error: cudaStreamCreate fail (invalid argument at /tmp/tmpwgpzgk9m/73a7502c-7662-4e80-804e-4debff15dc45.cu:140)
*** FAILED - ABORTING
SOlved:
The error was coming due to memory leakage. I was allocating the array pointers but was only freeing 1st one. As per suggestion from below answer from Robert, the memory should be for each index of the array. And also please always use proper error in cuda like this
.
Upvotes: 0
Views: 916
Reputation: 152143
Suggestion: Implement proper CUDA error checking. Use it on every cuda call. Your haphazard use of the error checking macro makes for a confusing output that seems to suggest a problem with stream creation.
That is not the case. The invalid argument is arising from your freeing operations at the end of the loop. You have a number of errors:
cudaFreeHost
on a pointer returned by malloc
, or on a pointer that is actually a stack array.cudaFree
on a pointer that is actually a stack array.cudaDeviceReset
(which frees all device allocations anyway), you have a memory leak because of improper freeing of the malloc
allocations.By modifying the end of your code as follows:
...
cudaEventDestroy( start );
cudaEventDestroy( stop);
for (int ind = 0; ind < itr; ind++){
free(h_a[ind]);
free(h_c[ind]);
cudaFree(d_a[ind]);
cudaFree(d_c[ind]);
}
// Free allocated memory
cudaFree(b);
cudaDeviceReset();
}
...
I was able to make the above errors disappear.
As an aside, it should not be necessary to create 5000 streams, but it appears to work so I'll leave it at that. I would normally advise stream reuse.
Stream reuse could look something like this. Instead of creating 5000 streams, pick a smaller number, like 5 (the exact number shouldn't matter much here. It's likely that anything in the range of 3 or greater will behave similarly).
Create that many streams:
const int max_streams = 5;
for (int ind = 0; ind < max_streams; ind++){
cudaStreamCreate(&(stream[ind]));
}
When it comes to using the streams, use modulo arithmetic to "rotate" through the streams:
for (int ind = 0; ind < itr; ind++){
cudaMemcpyAsync(d_a[ind], h_a[ind], bytes_a, cudaMemcpyHostToDevice, stream[ind%max_streams]);
}
cudaDeviceSynchronize();
for (int ind = 0; ind < itr; ind++){
// Launch our kernel
matrixMul<<<BLOCKS, THREADS, 0, stream[ind%max_streams]>>>(d_a[ind], b, d_c[ind], M, K, N);
}
cudaDeviceSynchronize();
...
Upvotes: 1