Reputation: 3885
I have a sample loop of following form. Notice that my psi[i][j]
is dependent on psi[i+1][j], psi[i-1][j], psi[i][j+1] and psi[i][j-1]
and I have to calculate psi
for inner matrix only. Now I tried writing this in CUDA but the results are not same as sequential.
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]=(omega[i][j]*(dx*dx)*(dy*dy)+(psi[i+1][j]+psi[i-1][j])*(dy*dy)+(psi[i][j+1]+psi[i][j-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
Here's my CUDA format.
//KERNEL
__global__ void ComputePsi(double *psi, double *omega, int imax, int jmax)
{
int x = blockIdx.x;
int y = blockIdx.y;
int i = (jmax*x) + y;
double beta = 1;
double dx=(double)30/(imax-1);
double dy=(double)1/(jmax-1);
if((i)%jmax!=0 && (i+1)%jmax!=0 && i>=jmax && i<imax*jmax-jmax){
psi[i]=(omega[i]*(dx*dx)*(dy*dy)+(psi[i+jmax]+psi[i-jmax])*(dy*dy)+(psi[i+1]+psi[i-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
}
//Code
cudaMalloc((void **) &dev_psi, leni*lenj*sizeof(double));
cudaMalloc((void **) &dev_omega, leni*lenj*sizeof(double));
cudaMemcpy(dev_psi, psi, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy(dev_omega, omega, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
dim3 grids(leni,lenj);
for(iterpsi=0;iterpsi<30;iterpsi++)
ComputePsi<<<grids,1>>>(dev_psi, dev_omega, leni, lenj);
Where psi[leni][lenj] and omega[leni][lenj]
and double arrays.
The problem is sequential and CUDA codes are giving different results. Is there any modification needed in the code?
Upvotes: 0
Views: 215
Reputation: 4926
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]= ( omega[i][j]*(dx*dx)*(dy*dy) +
(psi[i+1][j]+psi[i-1][j]) * (dy*dy) +
(psi[i][j+1]+psi[i][j-1]) * (dx*dx)
)/(2.0*(dx*dx)+2.0*(dy*dy));
}
I think that this kernel is not correct sequentially either: the value of psi[i][j]
depends on the order of the operations here - so you will be using not updated psi[i+1][j]
and psi[i][j+1]
, but psi[i-1][j]
and psi[i][j-1]
have been updated in this sweep.
Be sure that with CUDA the result will be different, where the order of the operations is different.
To enforce such an ordering, if possible at all, you would need to insert so many synchronizations that probably it's not worthwhile for CUDA. Is it really what you need to do?
Upvotes: 0
Reputation: 949
You are working in global memory and you are changing psi entries while other threads might need the old values. Just store the values of the new iteration in a separate variable. But keep in mind that you have to swap the variables after each iteration !! A more sophisticated approach would be a solution working with shared memory and spatial domain assignment to the separate threads. Just google for CUDA tutorials for the solving of the heat/diffusion equation and you will get the idea.
Upvotes: 1