Reputation: 61
I am doing huge calculations (derivatives here, but look similar to images operations) on arrays that do not fit in cache, meaning the CPU have to load parts in the cache, calculate, then load another part, etc. But because of the shape of calculations, some data are load, unload and reload multiple times. I was wondering if there is a way to optimize this. I am already using SIMD instructions using the compiler optimization (GCC and Intel).
This is Fortran calculations, but it's similar to C/C++, the memory order is just inverted and arrays use ()
instead of []
. for
is replaced by do
.
On x axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i+1,j,k)-F(i-1,j,k))*B+(F(i-2,j,k)-F(i+2,j,k))*C
end do
end do
end do
On y axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j+1,k)-F(i,j-1,k))*B+(F(i,j-2,k)-F(i,j+2,k))*C
end do
end do
end do
on z axe:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
DF(i,j,k)=(F(i,j,k+1)-F(i,j,k-1))*B+(F(i,j,k-2)-F(i,j,k+2))*C
end do
end do
end do
First derivative on axe x is OK because memory is read continuously. Derivatives on y and z axes are not continuous.
And the worst calculation I have combine all axes (This is a Laplacian operator):
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k) &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k) &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Note that compilers do not understand the last operation (Laplacian). To use SIMD (vectorized calculations), I need to split the operation like this, which gives a 2.5x speedup:
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
& + M(i,j,k,2) * p(i-1,j,k) &
& + M(i,j,k,3) * p(i+1,j,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,4) * p(i,j-1,k) &
& + M(i,j,k,5) * p(i,j+1,k)
end do
end do
end do
do k=1,N(3)
do j=1,N(2)
do i=3,N(1)
V(i,j,k) = V(i,j,k) + &
& + M(i,j,k,6) * p(i,j,k-1) &
& + M(i,j,k,7) * p(i,j,k+1)
end do
end do
end do
Maybe using SIMD I already reached maximum speed, but because these calculations takes days, even with MPI and more than 1024 CPU, reducing the time of calculations, even of 20% would be a great step! Does anyone of you have ideas on how to optimize this?
Upvotes: 3
Views: 142
Reputation: 60018
When you use 3D stencils and you reference elements like i,j,k-1
, i,j,k+1
the linear order in which you go through the array will not be optimal. The cache efficiency can be increased by loop tiling.
In my code I use
!$omp parallel private(i,j,k,bi,bj,bk)
!$omp do schedule(runtime) collapse(3)
do bk = 1, Unz, tnz
do bj = 1, Uny, tny
do bi = 1, Unx, tnx
do k = bk, min(bk+tnz-1,Unz)
do j = bj, min(bj+tny-1,Uny)
do i = bi, min(bi+tnx-1,Unx)
U2 (i,j,k) = U2(i,j,k) + &
(U(i+1,j,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i-1,j,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j+1,k)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j-1,k)) * ...
U2(i,j,k) = U2(i,j,k) + &
(U(i,j,k+1)-U(i,j,k)) * ...
U2(i,j,k) = U2(i,j,k) - &
(U(i,j,k)-U(i,j,k-1)) * ...
end do
end do
end do
end do
end do
end do
!$omp end do
where tnx
, tny
, tnz
are the sizes of a tile in which you iterate in the i,j,k
order. The size must be set to be close to the L1 cache. This will increase reuse of the content loaded into the cache.
If you need to separate the directions, you can of course do that and still keep the tiling.
Upvotes: 3