Reputation: 15849

C/C++ optimize data structures, array of arrays or just array

Working with a program that uses 16bytes 4v4 one byte matrices :

unsigned char matrix[4][4];

and a few 256bytes 16v16 one byte matrices:

unsigned char bigMatrix[16][16];

Very often due to data manipulation I am forced to loop column wise in the program making cache misses.

Will the performance improve if I use an array instead, i.e.

unsigned char matrix[16];
unsigned char matrix[256];

and access the elements by using some variables to retrieve elements, i.e.

matrix[variableA*variableB + i];

where variableA*variableB+i needs to be recalculated every time I want to access an element.

I only want speed optimization and memory is of no problem. Will this help, as in give some performance hit or loss, or is the difference too small to even care ?

Upvotes: 4

Answers (7)

Skizz

Reputation: 71120

jalf is mostly right. L1 cache is divided up into chunks, the size of the chunks is dependant on the processor but is in the order of 32 bytes. So, if you were stepping through memory a byte at a time, you'd get a cache miss every 32 bytes (or whatever the chunk size is). Now, the Intel chip is quite clever here and can detect sequential reads and prefetch the data reducing the impact of a cache miss.

The 4x4 matrix is highly likely to reside in a single L1 chunk (or cache line), so accessing it by row or by column makes little difference. Of course, you don't want to split the matrix across two cache lines, so well aligned memory is important.

The 16x16 matrix, however, won't fit in a cache line. So, if you're skipping through the array processing columns you're going to get lots of cache misses. The computation of the index, as jalf said, makes little difference as the ratio between CPU and memory is high (i.e. you can do a lot of CPU work for each cache miss).

Now, if you are mainly processing the matrix in a column orientated way, then your best option is to transpose all your matrices (swap rows with columns), thus your memory accesses will be more sequential and the number of cache misses will be reduced and the CPU will be able to prefetch data better. So, instead of organising the matrix like this:

  0   1   2 .... 15
 16  17  18 .... 31
....
240 241 242 .... 255

where the number is the memory offset from the start of the matrix, organise thus:

 0 16 32 ... 240
 1 17 33 ... 241
 ...
15 31 47 ... 255

Upvotes: 7

Christoph

Reputation: 169793

Very often due to data manipulation I am forced to loop column wise [...]

You can't have it both ways: either row-wise or column-wise looping will lead to cache misses if the matrix is 'big enough' (see Skizz' answer). Optimise for the type of looping which gets executed more often.

If memory consumption is not an issue, you might also consider saving both the matrix and its transpose.

Upvotes: 0

Sungwon Jeong

Reputation: 1267

When I was in the school, one of my CS teacher insisted that if you make array for singledimension it will be more faster. That day I was very annoyed...

Upvotes: 1

xtofl

Reputation: 41519

While the compiled code will behave equally fast, there is some design issue: reuse of the indexation code could be maximized.

The best way to do so, imho, is wrapping it up in a container that knows how to loop over it's elements in the fastest way. They got a name for this: an 'internal iterator', as mentioned in the GoF Design Patterns "Iterator" pattern.

A brief example:

 template< int N >
 struct CNxN { 
     typedef int t_row[N];
     typedef t_row t_matrix[N];
     t_matrix m_Contents; 

     template< typename Functor >
     void each( Functor & f ) {
         for( int col = 0; col != N; ++col )
             for( int row = 0; row != N; ++row )
                 f( m_Contents[row][col] );
     }
 };

 // client code
 CNxN<3> matrix = { { {1,1,1},{1,1,1},{1,1,1} } };

 struct sum { 
      long result; 
      sum():result(0){} 
      void operator()( int i ){ result +=i; } 
 };
 matrix.each( sum );
 assert(sum.result==0); 
 assert(has_performed_in_the_fastest_possible_way);//;)

Upvotes: 2

Adam Hawes

Reputation: 5449

The big linear array can be slightly faster if you do sequential access to the array because you save a multiply operation at every index. If you're looping column-wise then you are accessing sequentially; at least in [row][col] notation, which has been "standard" with everyone I've ever spoken to.

I doubt your 256 element array will cause cache misses on modern hardware, but I am willing to be proven wrong. What does cachegrind tell you?

Upvotes: 0

Motti

Reputation: 114815

You say that variableA*variableB+i needs to be recalculated every time you access an element, well this happens in any case even when using multidimensional arrays. The only difference is that in multidimensional arrays the compiler emits this code so you don't see it and in the one dimensional array you see the code in your source.

Upvotes: 1

Stack Overflow is garbage

Reputation: 248269

It makes no difference. The data is laid out in the exact same way in either case, and is accessed in the same way too. I'd be surprised if it didn't generate exactly the same assembly, even.

However, with a 256byte table, you're unlikely to get cache misses in any case. The CPU's L1 cache is typically between 32 and 128KB, so I doubt you're getting many cache misses in any case.

Upvotes: 17

C/C++ optimize data structures, array of arrays or just array

Answers (7)

Related Questions