Reputation: 317
I try to rewrite codes from Fortran to C++ with the matrix implements through Armadillo library. The result is the same for both codes, but the C++ code is much slower than Fortran(> 10x). The codes involve small matrix (2x2, 4x4) inverse, multiplication and addition. I put a part of the similar code here for testing.
============================
clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2
ifort fort.f90 -o fort -O2
C++ code time: 0.39404s
Fortran code time: 0.068s
============================
C++ code:
#include <armadillo>
#include <iostream>
int main()
{
const int niter = 1580000;
const int ns = 3;
arma::cx_cube m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns);
arma::wall_clock timer;
timer.tic();
for (auto i=0; i<niter; ++i) {
for (auto j=0; j<ns; ++j)
m1.slice(j) += m2.slice(j) * m3.slice(j);
}
double n = timer.toc();
std::cout << "time: " << n << "s" << std::endl;
return 0;
}
Fortran code:
program main
implicit none
integer, parameter :: ns = 3, niter = 1580000
complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
integer i, j
real :: start, finish
call cpu_time(start)
do i = 1, niter
do j = 1, ns
m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
end do
end do
call cpu_time(finish)
print *, "time: ", finish-start, " s"
end program main
====================================================================
following @ewcz @user5713492 advice
============================
clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2
ifort fort.f90 -o fort -O2
ifort fort2.f90 -o fort2 -O2
C++ code(cplusplus.cc) time: 0.39650s
Fortran code(fort.f90) (explicitly operation) time: 0.020s
Fortran code(fort2.f90) (matmul) time: 0.064s
============================
C++ code(cplusplus.cc):
#include <armadillo>
#include <iostream>
#include <complex>
int main()
{
const int niter = 1580000;
const int ns = 3;
arma::cx_cube m1(2, 2, ns, arma::fill::ones),
m2(2, 2, ns, arma::fill::ones),
m3(2, 2, ns,arma::fill::ones);
std::complex<double> result;
arma::wall_clock timer;
timer.tic();
for (auto i=0; i<niter; ++i) {
for (auto j=0; j<ns; ++j)
m1.slice(j) += m2.slice(j) * m3.slice(j);
}
double n = timer.toc();
std::cout << "time: " << n << "s" << std::endl;
result = arma::accu(m1);
std::cout << result << std::endl;
return 0;
}
Fortran code(fort.f90):
program main
implicit none
integer, parameter :: ns = 3, niter = 1580000
complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
integer i, j
complex*16 result
real :: start, finish
m1 = 1
m2 = 1
m3 = 1
call cpu_time(start)
do i = 1, niter
do j = 1, ns
m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
end do
end do
call cpu_time(finish)
result = sum(m1)
print *, "time: ", finish-start, " s"
print *, result
end program main
Fortran code(fort2.f90):
program main
implicit none
integer, parameter :: ns = 3, niter = 1580000
complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
integer i, j
complex*16 result
real :: start, finish
m1 = 1
m2 = 1
m3 = 1
call cpu_time(start)
do i = 1, niter
do j = 1, ns
m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))
end do
end do
call cpu_time(finish)
result = sum(m1)
print *, "time: ", finish-start, " s"
print *, result
end program main
======================================================================
The complex number may be one of the reasons that armadillo is so slow. If I use arma::cube
instead of arma::cx_cube
in C++ and use real*8
in Fortran, the time is:
C++ code time: 0.08s
Fortran code(fort.f90) (explicitly operation) time: 0.012s
Fortran code(fort2.f90) (matmul) time: 0.028s
However, complex number is necessary for my computation. It's strange that computation time increases very large for armadillo library but a little for Fortran.
Upvotes: 3
Views: 854
Reputation: 13087
I would say that your Fortran version profits significantly in this particular example from expanding the matrix multiplication explicitly into elementary operations. In order to demonstrate this, let's assume following modification:
implicit none
integer, parameter :: ns = 3, niter = 1580000
complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
integer i, j
real :: start, finish
call cpu_time(start)
m2 = 1
m3 = 1
do i = 1, niter
do j = 1, ns
!m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
!m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
!m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
!m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
m1(:, :, j) = m1(:, :, j) + MATMUL(m2(:, :, j), m3(:, :, j))
end do
end do
WRITE(*, *) SUM(m1)
call cpu_time(finish)
print *, "time: ", finish-start, " s"
Here, at the end, the program prints the sum of m1
in order to make at least partially sure that the entire loop is not eliminated. With the explicit multiplication (and -O2
), I get running time of roughly 0.05s while with the general MATMUL
it's roughly 0.2s, i.e., similar to the Armadillo approach...
Also, even though Armadillo is heavily template based so lots of the functions calls with respect to creating the subcube views via slice()
might get eliminated, you still in principle have some overhead while with Fortran, you are directly manipulating continuous chunks of memory.
Upvotes: 2
Reputation: 974
You aren't timing anything in gfortran. It can see at level -O2 that you don't use the value of m1 so it skips the calculation entirely. Also in Fortran your arrays are uninitialized so you could be doing calculations with NaNs which might slow things down considerably.
So you should initialize your arrays and use some kind of input like the command line, user input, or file contents so the compiler can't precompute the results.
Then you might consider changing the loop contents in Fortran to
m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))
So as to be consistent with the C++ stuff. (gfortran seemed to slow down a lot when doing this but ifort was quite happy with it.)
Then you MUST print out your arrays at the end so the compiler doesn't conclude that the loop you are timing can be skipped as gfortran did. Edit in the fixes and let us know about the new results.
Upvotes: 3