Reputation:
I am dealing with big data, stored in 3D arrays. Here is a kernel example of what I did (called in a for loop by the CPU) :
attributes(global) subroutine mykernel (A,B,C,p,nx,ny,nz)
real,dimension(:,:,:),device :: A,B
real,dimension(:),device :: C
real,device :: p
integer,device :: nx,ny,nz
xInd = blockDim.x * (blockIdx.x-1) + threadIdx.x;
yInd = blockDim.y * (blockIdx.y-1) + threadIdx.y;
zInd = blockDim.z * (blockIdx.z-1) + threadIdx.z;
if (xInd<=nx) then
if (yInd<=ny) then
if (zInd<=nz) then
A(xInd,yInd,zInd)=(A(xInd,yInd+1,zInd)-A(xInd,yInd,zInd))*p-(B(xInd,yInd,zInd+1)-C(yInd)+B(xInd+1,yInd,zInd))*p+C(yInd+1)
end if
end if
end if
end subroutine mykernel
Everything seems fine when I'm launching the kernel, GPU results are the same as CPU results... But performances are not really great, in terms of time.
I think it is due to memory access here, but I'm not sure. I would have put my 3D arrays in the shared memory, but I'm dealing with nxnynz > 1M data, so there isn't enough space in the shared memory.
So my following questions are about performances issues, with a large set of data :
Upvotes: 1
Views: 268
Reputation:
Okay so I think I have figured out what are my issues here, in my case.
First, the execution configuration of my kernels. Working with 3D arrays seems not to be a good idea, because I use too much threads. For example, here I choose to work with blocks of 512 threads. So I call mykernel with 512*(348/8+1)(145/8+1)(113/8+1)= 6 590 628 Threads. If I flatten my 3D array to 1D, I only use 512*((348*145*113)/512+1)=5 702 492 Threads. But why using more threads impacts my performances here ?
Morover, in the CPU loop (where I call mykernel) : I was using too much transfers between the CPU and the GPU. So, to reduce the time of these transfers I used the pinned memory which is quite efficient. I strongly recommand this link for more explanations on how to optimize data transfers.
With all these things, my GPU code works x16 times faster than the CPU code, which is quite great ! The first version of my code was working "only" x7 times better.
Hope it could help.
Upvotes: 1