How to deal with large 3D data arrays for better performance?

Question

I am dealing with big data, stored in 3D arrays. Here is a kernel example of what I did (called in a for loop by the CPU) :

attributes(global) subroutine mykernel (A,B,C,p,nx,ny,nz)

real,dimension(:,:,:),device :: A,B
real,dimension(:),device :: C
real,device :: p
integer,device :: nx,ny,nz

xInd = blockDim.x * (blockIdx.x-1) + threadIdx.x;
yInd = blockDim.y * (blockIdx.y-1) + threadIdx.y;
zInd = blockDim.z * (blockIdx.z-1) + threadIdx.z;

if (xInd<=nx) then
 if (yInd<=ny) then
  if (zInd<=nz) then
   A(xInd,yInd,zInd)=(A(xInd,yInd+1,zInd)-A(xInd,yInd,zInd))*p-(B(xInd,yInd,zInd+1)-C(yInd)+B(xInd+1,yInd,zInd))*p+C(yInd+1)
  end if
 end if
end if

end subroutine mykernel

Everything seems fine when I'm launching the kernel, GPU results are the same as CPU results... But performances are not really great, in terms of time.

I think it is due to memory access here, but I'm not sure. I would have put my 3D arrays in the shared memory, but I'm dealing with nxnynz > 1M data, so there isn't enough space in the shared memory.

So my following questions are about performances issues, with a large set of data :

Should I flatten my 3D arrays to 1D arrays ? Will I get a boost ?
Is it possible to read (memory access) large arrays of data without using global or shared memory ?
What are the other possibilities of performances issues in this case ?

How to deal with large 3D data arrays for better performance?

Answers (1)

Related Questions