theoretical and practical matrix multiplication FLOP

Question

My system:

system specification: Intel core2duo E4500 3700g memory L2 cache 2M x64 fedora 17

How I measure flops/mflops

well,I use papi library (to read hardware performance counter) to measure flops and mflops of my code.it return real time procesing time, flops and finally flops/process time which is equal to MFLOPS.library use hardware counter to count floating point inststruction or floating point operations and Total cycle to get the final result that contain flops and MFLOPS.

MY computational kernel

I used three loop matrix matrix multiplication (square matrix) and three nested loop which do some operation on 1d array in its inner-loop.

First Kernel MM

    float a[size][size];
    float b[size][size];
    float c[size][size];

 start_calculate_MFlops();

for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; **k+=1**) {
                *c[i][j]=c[i][j]+a[i][k] * b[k][j];*
                     }
            }
 }
  stop_calculate_MFlops();

Second kernel with 1d array

    float d[size];
    float e[size];
    float f[size];
    float g[size];
    float r = 3.6541;

 start_calculate_MFlops();

for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
        for (int k = 0; k < size; ++k) {
            d[k]=d[k]+e[k]+f[k]+g[k]+r;
        }
    }
}    

stop_calculate_MFlops();

what I know about flops

Matrix matrix Multiplication (MM) do 2 operation in its inner loop (here floating point operation) and as there is 3 loop which iterate for size X therefore in theory we have total flops of 2*n^3 for MM.

In second kernel we have 3 loop which in inner-most loop we have 1d array which do some computation.there is 4 floating point operation in this loop.hence we have total flops of 4*n^3 flops in theory

I know that the flops that we calculate above is not exactly the same as what will happen in real machine. In real machine there are other operation like load and store wich will add up to out theoretical flops.

Questions ?:

when I use 1d array as in second kernel theoretical flops is the same or around the flops I get by executing the code and measuring it.actually when I use 1d array flops is equal to # of operation in inner-most loop multiply by n^3 but when I use my first kernel MM which use 2d array theoretical flop is 2n^3 but when I run the code ,measured value is too much higher than theoretical value,it is about 4+(2 operation in inner-most loop of matrix multiplication)*n^3+=6n^3. I changed the matrix multiplication line in innermost loop with just the code below:
```
A[i][j]++;
```
the theoretical flops for this code in 3 nested loop is 1 operation * n^3=n^3 again when I ran the code the result was too higher than what expected which was 2+(1 operation of inner-most loop)*n^3=3*n^3

Sample Results for matrix of size 512X512 :

Real_time: 1.718368 Proc_time: 1.227672 Total flpops: 807,107,072 MFLOPS: 657.429016

Real_time: 3.608078 Proc_time: 3.042272 Total flpops: 807,024,448 MFLOPS: 265.270355

theoretical flop: 2*512*512*512=268,435,456

Measured flops= 6*512^3 =807,107,072

Sample Result for 1d array operation in 3 nested loop

Real_time: 1.282257 Proc_time: 1.155990 Total flpops: 536,872,000 MFLOPS: 464.426117

theoretical flop: 4n^3 = 536,870,912

Measured flop: 4n^3=4*512^3+overheads(other operation?)=536,872,000

I could not find any reason for the aforementioned behaviour? Is my assumption true ?

Hope to make it much simpler than before description.

By practical I meant real flop measured by executing the code.

Code:

 void countFlops() {

    int size = 512;
    int itr = 20;
    float a[size][size];
    float b[size][size];
    float c[size][size];
/*  float d[size];
    float e[size];
    float f[size];
    float g[size];*/
        float r = 3.6541;

    float real_time, proc_time, mflops;
    long long flpops;
    float ireal_time, iproc_time, imflops;
    long long iflpops;
    int retval;

    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            a[j][j] = b[j][j] = c[j][j] = 1.0125;
        }
    }

/*  for (int i = 0; i < size; ++i) {
                d[i]=e[i]=f[i]=g[i]=10.235;
        }*/

    if ((retval = PAPI_flops(&ireal_time, &iproc_time, &iflpops, &imflops))
            < PAPI_OK) {
        printf("Could not initialise PAPI_flops 
");
        printf("Your platform may not support floating point operation event.
");
        printf("retval: %d
", retval);
        exit(1);
    }
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; k+=16) {
                c[i][j]=c[i][j]+a[i][k] * b[k][j];
            }
        }
    }

/*  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
        for (int k = 0; k < size; ++k) {
            d[k]=d[k]+e[k]+f[k]+g[k]+r;
        }
    }
    }*/

    if ((retval = PAPI_flops(&real_time, &proc_time, &flpops, &mflops))
            < PAPI_OK) {
        printf("retval: %d
", retval);
        exit(1);
    }
    string flpops_tmp;
    flpops_tmp = output_formatted_string(flpops);
    printf(
            "calculation: Real_time: %f Proc_time: %f Total flpops: %s MFLOPS: %f
",
            real_time, proc_time, flpops_tmp.c_str(), mflops);

}

thank you

theoretical and practical matrix multiplication FLOP

Answers (1)

Related Questions