Reputation: 165
My system:
system specification: Intel core2duo E4500 3700g memory L2 cache 2M x64 fedora 17
How I measure flops/mflops
well,I use papi library (to read hardware performance counter) to measure flops and mflops of my code.it return real time procesing time, flops and finally flops/process time which is equal to MFLOPS.library use hardware counter to count floating point inststruction or floating point operations and Total cycle to get the final result that contain flops and MFLOPS.
MY computational kernel
I used three loop matrix matrix multiplication (square matrix) and three nested loop which do some operation on 1d array in its inner-loop.
First Kernel MM
float a[size][size];
float b[size][size];
float c[size][size];
start_calculate_MFlops();
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; **k+=1**) {
*c[i][j]=c[i][j]+a[i][k] * b[k][j];*
}
}
}
stop_calculate_MFlops();
Second kernel with 1d array
float d[size];
float e[size];
float f[size];
float g[size];
float r = 3.6541;
start_calculate_MFlops();
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
d[k]=d[k]+e[k]+f[k]+g[k]+r;
}
}
}
stop_calculate_MFlops();
what I know about flops
Matrix matrix Multiplication (MM) do 2 operation in its inner loop (here floating point operation) and as there is 3 loop which iterate for size X therefore in theory we have total flops of 2*n^3 for MM.
In second kernel we have 3 loop which in inner-most loop we have 1d array which do some computation.there is 4 floating point operation in this loop.hence we have total flops of 4*n^3 flops in theory
I know that the flops that we calculate above is not exactly the same as what will happen in real machine. In real machine there are other operation like load and store wich will add up to out theoretical flops.
Questions ?:
when I use 1d array as in second kernel theoretical flops is the same or around the flops I get by executing the code and measuring it.actually when I use 1d array flops is equal to # of operation in inner-most loop multiply by n^3 but when I use my first kernel MM which use 2d array theoretical flop is 2n^3 but when I run the code ,measured value is too much higher than theoretical value,it is about 4+(2 operation in inner-most loop of matrix multiplication)*n^3+=6n^3. I changed the matrix multiplication line in innermost loop with just the code below:
A[i][j]++;
the theoretical flops for this code in 3 nested loop is 1 operation * n^3=n^3 again when I ran the code the result was too higher than what expected which was 2+(1 operation of inner-most loop)*n^3=3*n^3
Sample Results for matrix of size 512X512 :
Real_time: 1.718368 Proc_time: 1.227672 Total flpops: 807,107,072 MFLOPS: 657.429016
Real_time: 3.608078 Proc_time: 3.042272 Total flpops: 807,024,448 MFLOPS: 265.270355
theoretical flop: 2*512*512*512=268,435,456
Measured flops= 6*512^3 =807,107,072
Sample Result for 1d array operation in 3 nested loop
Real_time: 1.282257 Proc_time: 1.155990 Total flpops: 536,872,000 MFLOPS: 464.426117
theoretical flop: 4n^3 = 536,870,912
Measured flop: 4n^3=4*512^3+overheads(other operation?)=536,872,000
I could not find any reason for the aforementioned behaviour? Is my assumption true ?
Hope to make it much simpler than before description.
By practical I meant real flop measured by executing the code.
Code:
void countFlops() {
int size = 512;
int itr = 20;
float a[size][size];
float b[size][size];
float c[size][size];
/* float d[size];
float e[size];
float f[size];
float g[size];*/
float r = 3.6541;
float real_time, proc_time, mflops;
long long flpops;
float ireal_time, iproc_time, imflops;
long long iflpops;
int retval;
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[j][j] = b[j][j] = c[j][j] = 1.0125;
}
}
/* for (int i = 0; i < size; ++i) {
d[i]=e[i]=f[i]=g[i]=10.235;
}*/
if ((retval = PAPI_flops(&ireal_time, &iproc_time, &iflpops, &imflops))
< PAPI_OK) {
printf("Could not initialise PAPI_flops \n");
printf("Your platform may not support floating point operation event.\n");
printf("retval: %d\n", retval);
exit(1);
}
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; k+=16) {
c[i][j]=c[i][j]+a[i][k] * b[k][j];
}
}
}
/* for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
d[k]=d[k]+e[k]+f[k]+g[k]+r;
}
}
}*/
if ((retval = PAPI_flops(&real_time, &proc_time, &flpops, &mflops))
< PAPI_OK) {
printf("retval: %d\n", retval);
exit(1);
}
string flpops_tmp;
flpops_tmp = output_formatted_string(flpops);
printf(
"calculation: Real_time: %f Proc_time: %f Total flpops: %s MFLOPS: %f\n",
real_time, proc_time, flpops_tmp.c_str(), mflops);
}
thank you
Upvotes: 3
Views: 7822
Reputation: 9199
If you need to count number of your operations - you can make simple class which acts like floating point value and gathers statistics. It will be interchangeable with builtin types.
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/operators.hpp>
#include <iostream>
#include <ostream>
#include <utility>
#include <cstddef>
#include <vector>
using namespace boost;
using namespace std;
class Statistic
{
size_t ops = 0;
public:
Statistic &increment()
{
++ops;
return *this;
}
size_t count() const
{
return ops;
}
};
template<typename Domain>
class Profiled: field_operators<Profiled<Domain>>
{
Domain value;
static vector<Statistic> stat;
void stat_increment()
{
stat.back().increment();
}
public:
struct StatisticScope
{
StatisticScope()
{
stat.emplace_back();
}
Statistic ¤t()
{
return stat.back();
}
~StatisticScope()
{
stat.pop_back();
}
};
template<typename ...Args>
Profiled(Args&& ...args)
: value{forward<Args>(args)...}
{}
Profiled& operator+=(const Profiled& x)
{
stat_increment();
value+=x.value;
return *this;
}
Profiled& operator-=(const Profiled& x)
{
stat_increment();
value-=x.value;
return *this;
}
Profiled& operator*=(const Profiled& x)
{
stat_increment();
value*=x.value;
return *this;
}
Profiled& operator/=(const Profiled& x)
{
stat_increment();
value/=x.value;
return *this;
}
};
template<typename Domain>
vector<Statistic> Profiled<Domain>::stat{1};
int main()
{
typedef Profiled<double> Float;
{
Float::StatisticScope s;
Float x = 1.0, y = 2.0, res = 0.0;
res = x+y*x+y;
cout << s.current().count() << endl;
}
{
using namespace numeric::ublas;
Float::StatisticScope s;
matrix<Float> x{10, 20},y{20,5},res{10,5};
res = prod(x,y);
cout << s.current().count() << endl;
}
}
Output is:
3
2000
P.S. Your matrix loop is not cache-friendly, and as the result very inefficient.
P.P.S
int size = 512;
float a[size][size];
This is not legal C++ code. C++ does not support VLA.
Upvotes: 1