Multiplying large matrices is much slower with contiguous memory allocation

Question

While implementing a neural network, I noticed that if I allocate memory as a single contiguous block for the data set arrays, execution time increases several times.

Compare these two methods of memory allocation:

float** alloc_2d_float(int rows, int cols, int contiguous)
{
    int i;
    float** array = malloc(rows * sizeof(float*));

    if(contiguous)
    {
        float* data = malloc(rows*cols*sizeof(float));
        assert(data && "Can't allocate contiguous memory");

        for(i=0; i



Here are the results when compiling with -march=native -Ofast (tried gcc and clang):

michael@Pascal:~/NN$ time ./test 300 1 0

Multiplying (100000, 1000) and (300, 1000) arrays 1 times, noncontiguous memory allocation.

Allocating memory:    0.2 seconds
Initializing arrays: 0.8 seconds
Dot product:         3.3 seconds

real    0m4.296s
user    0m4.108s
sys     0m0.188s

michael@Pascal:~/NN$ time ./test 300 1 1

Multiplying (100000, 1000) and (300, 1000) arrays 1 times, contiguous memory allocation.

Allocating memory:    0.0 seconds
Initializing arrays: 40.3 seconds
Dot product:         13.5 seconds    

real    0m53.817s
user    0m4.204s
sys     0m49.664s


Here's the code:
https://github.com/michaelklachko/NN/blob/master/test.c

Note that both initializing and dot product are much slower for contiguous memory.

I expected the opposite - a contiguous block of memory should be more cache friendly than a large number of separate small blocks. Or at least they should be similar in performance (this machine has 64GB of RAM, and 90% of it is unused).

EDIT: Here's the compressed self-contained code (I still recommend using the github version instead, which has measuring and formatting statements):

#include 
#include 
#include 

float** alloc_2d_float(int rows, int cols, int contiguous){
    int i;
    float** array = malloc(rows * sizeof(float*));
    if(contiguous){
        float* data = malloc(rows*cols*sizeof(float));
        for(i=0; i

Multiplying large matrices is much slower with contiguous memory allocation

Answers (1)

Related Questions