Poor maths performance in C vs Python/numpy

Question

Near-duplicate / related:

How does BLAS get such extreme performance? (If you want fast matmul in C, seriously just use a good BLAS library unless you want to hand-tune your own asm version.) But that doesn't mean it's not interesting to see what happens when you compile less-optimized matrix code.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
Matrix Multiplication with blocks

Out of interest, I decided to compare the performance of (inexpertly) handwritten C vs. Python/numpy performing a simple matrix multiplication of two, large, square matrices filled with random numbers from 0 to 1.

I found that python/numpy outperformed my C code by over 10,000x This is clearly not right, so what is wrong with my C code that is causing it to perform so poorly? (even compiled with -O3 or -Ofast)

The python:

import time
import numpy as np

t0 = time.time()
m1 = np.random.rand(2000, 2000)
m2 = np.random.rand(2000, 2000)
t1 = time.time()
m3 = m1 @ m2
t2 = time.time()
print('creation time: ', t1 - t0, ' 
 multiplication time: ', t2 - t1)

The C:

#include 
#include 
#include 

int main(void) {

    clock_t t0=clock(), t1, t2;

    // create matrices and allocate memory  
    int m_size = 2000;
    int i, j, k;
    double running_sum;
    double *m1[m_size], *m2[m_size], *m3[m_size];
    double f_rand_max = (double)RAND_MAX;
    for(i = 0; i < m_size; i++) {
        m1[i] = (double *)malloc(sizeof(double)*m_size);
        m2[i] = (double *)malloc(sizeof(double)*m_size);
        m3[i] = (double *)malloc(sizeof(double)*m_size);
    }
    // populate with random numbers 0 - 1
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++) {
            m1[i][j] = (double)rand() / f_rand_max;
            m2[i][j] = (double)rand() / f_rand_max;
        }
    t1 = clock();

    // multiply together
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++) {
            running_sum = 0;
            for (k = 0; k < m_size; k++)
                running_sum += m1[i][k] * m2[k][j];
            m3[i][j] = running_sum;
        }

    t2 = clock();

    float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
    float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
    printf("creation time: %f", t01 );
    printf("
multiplication time: %f", t12 );
    return 0;
}

EDIT: Have corrected the python to do a proper dot product which closes the gap a little and the C to time with a resolution of microseconds and use the comparable double data type, rather than float, as originally posted.

Outputs:

$ gcc -O3 -march=native bench.c
$ ./a.out
creation time: 0.092651
multiplication time: 139.945068
$ python3 bench.py
creation time: 0.1473407745361328
multiplication time: 0.329038143157959

It has been pointed out that the naive algorithm implemented here in C could be improved in ways that lend themselves to make better use of compiler optimisations and the cache.

EDIT: Having modified the C code to transpose the second matrix in order to achieve a more efficient access pattern, the gap closes more

The modified multiplication code:

// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
    for (j=0; j < m_size; j++)
        m3[j][i] = m2[i][j];
// swap the pointers
void *mtemp = *m3;
*m3 = *m2;
*m2 = mtemp;


// multiply together
for (i=0; i < m_size; i++)
    for (j=0; j < m_size; j++) {
        running_sum = 0;
        for (k = 0; k < m_size; k++)
            running_sum += m1[i][k] * m2[j][k];
        m3[i][j] = running_sum;
    }

The results:

$ gcc -O3 -march=native bench2.c
$ ./a.out
creation time: 0.107767
multiplication time: 10.843431
$ python3 bench.py
creation time: 0.1488208770751953
multiplication time: 0.3335080146789551

EDIT: compiling with -0fast, which I am reassured is a fair comparison, brings down the difference to just over an order of magnitude (in numpy's favour).

$ gcc -Ofast -march=native bench2.c
$ ./a.out
creation time: 0.098201
multiplication time: 4.766985
$ python3 bench.py
creation time:  0.13812589645385742
multiplication time:  0.3441300392150879

EDIT: It was suggested to change indexing from arr[i][j] to arr[i*m_size + j] this yielded a small performance increase:

for m_size = 10000

$ gcc -Ofast -march=native bench3.c # indexed by arr[ i * m_size + j ]
$ ./a.out
creation time: 1.280863
multiplication time: 626.327820
$ gcc -Ofast -march=native bench2.c # indexed by art[I][j]
$ ./a.out
creation time: 2.410230
multiplication time: 708.979980
$ python3 bench.py
creation time:  3.8284950256347656
multiplication time:  39.06089973449707

The up to date code bench3.c:

#include 
#include 
#include 

int main(void) {

    clock_t t0, t1, t2;

    t0 = clock();
    // create matrices and allocate memory
    int m_size = 10000;
    int i, j, k, x, y;
    double running_sum;
    double *m1 = (double *)malloc(sizeof(double)*m_size*m_size),
                *m2 = (double *)malloc(sizeof(double)*m_size*m_size),
                *m3 = (double *)malloc(sizeof(double)*m_size*m_size);
    double f_rand_max = (double)RAND_MAX;

    // populate with random numbers 0 - 1
    for (i=0; i < m_size; i++) {
        x = i * m_size;
        for (j=0; j < m_size; j++)
            m1[x + j] = ((double)rand()) / f_rand_max;
          m2[x + j] = ((double)rand()) / f_rand_max;
            m3[x + j] = ((double)rand()) / f_rand_max;
    }
    t1 = clock();

    // transpose m2 in order to capitalise on cache efficiencies
    // store transposed matrix in m3 for now
    for (i=0; i < m_size; i++)
        for (j=0; j < m_size; j++)
            m3[j*m_size + i] = m2[i * m_size + j];
    // swap the pointers
    double *mtemp = m3;
    m3 = m2;
    m2 = mtemp;


    // multiply together
    for (i=0; i < m_size; i++) {
        x = i * m_size;
        for (j=0; j < m_size; j++) {
            running_sum = 0;
            y = j * m_size;
            for (k = 0; k < m_size; k++)
                running_sum += m1[x + k] * m2[y + k];
            m3[x + j] = running_sum;
        }
    }

    t2 = clock();

    float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
    float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
    printf("creation time: %f", t01 );
    printf("
multiplication time: %f", t12 );
    return 0;
}

ezekiel · Accepted Answer

CONCLUSION: So the original absurd factor of x10,000 difference was largely due to mistakenly comparing element-wise multiplication in Python/numpy to C code and not compiled with all of the available optimisations and written with a highly inefficient memory access pattern that likely didn't utilise the cache. A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350 A number of simple edits to improve the memory access pattern brought the comparison down to a factor of x16 (in numpy's favour) for large matrix (10000 x 10000) multiplication. Furthermore, numpy automatically utilises all four virtual cores on my machine whereas this C does not, so the performance difference could be a factor of x4 - x8 (depending on how well this program ran on hyperthreading). I consider a factor of x4 - x8 to be fairly sensible, given that I don't really know what I'm doing and just knocked a bit of code together whereas numpy is based on BLAS which I understand has been extensively optimised over the years by experts from all over the place so I consider the question answered/solved.

Poor maths performance in C vs Python/numpy

Answers (1)

Related Questions