Reputation: 43
What I tried to do was to simply apply cublasDgemm (matrix-matrix multiplication) on several matrices with "double" (8 bytes) type element all of which have one dimension that is very large. In my case, the sizes of the matrices are 12755046 by 46. Simply say, A[46,12755046]*B_i[12755046,46] = C_i[46,46], where i = 1,2,3,....
The machine includes 128GB memory and two GTX2080Ti (11GB GPU memory) so my original strategy was to distribute B_i to each GPU. However, I always get INTERNAL ERROR when I execute my code on two GPUs.
So I solved this problem by trying three things: 1. use one GPU only. No error. 2. downsize the matrix size but keep using two GPUs. No error. 3. use cublasXt which implicitly uses two GPUs. No error.
Though it is solved, I am still interested in finding an answer to why my original plan did not work for large dimension matrix? I am guessing this could be due to some internal limitations from cublas or I missed some configurations?
I attached my simplified code here to illustrate my original plan:
double *A, *B[2], *C[2];
cudaMallocManaged(&A, 46*12755046*sizeof(double));
cudaMallocManaged(&B[0], 46*12755046*sizeof(double));
cudaMallocManaged(&B[1], 46*12755046*sizeof(double));
cudaMallocManaged(&C[0], 46*12755046*sizeof(double));
cudaMallocManaged(&C[1], 46*12755046*sizeof(double));
givevalueto(A);
givevalueto(B[0]);
givevalueto(B[1]);
double alpha = 1.0;
double beta = 0.0;
cublasHandle_t handle[nGPUs];
int iGPU;
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cublasCreate (& handle[iGPU]);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cublasDgemm(handle[iGPU],CUBLAS_OP_N,CUBLAS_OP_N,46,46,12755046,&alpha,A,46,B[iGPU],12755046,&beta,C[iGPU],46);
}
for(iGPU=0;iGPU<nGPUs;i++)
{
cudaSetDevice(iGPU);
cudaDeviceSynchronize();
}
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaFree(B[iGPU]);
}
Upvotes: 0
Views: 572
Reputation: 151829
The cublas handle is applicable to the device that was active when the handle was created.
From the documentation for cublasCreate
:
The CUBLAS library context is tied to the current CUDA device.
See also the description of the cublas context:
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls. In order for the cuBLAS library to use a different device in the same host thread, the application must set the new device to be used by calling cudaSetDevice() and then create another cuBLAS context, which will be associated with the new device, by calling cublasCreate().
You can fix your code with:
for(iGPU=0;iGPU<nGPUs;iGPU++)
{
cudaSetDevice(iGPU); // add this line
cublasCreate (& handle[iGPU]);
}
Upvotes: 2