Reputation: 15849
Working with a program that uses 16bytes 4v4 one byte matrices :
unsigned char matrix[4][4];
and a few 256bytes 16v16 one byte matrices:
unsigned char bigMatrix[16][16];
Very often due to data manipulation I am forced to loop column wise in the program making cache misses.
Will the performance improve if I use an array instead, i.e.
unsigned char matrix[16];
unsigned char matrix[256];
and access the elements by using some variables to retrieve elements, i.e.
matrix[variableA*variableB + i];
where variableA*variableB+i needs to be recalculated every time I want to access an element.
I only want speed optimization and memory is of no problem. Will this help, as in give some performance hit or loss, or is the difference too small to even care ?
Upvotes: 4
Views: 4530
Reputation: 71120
jalf is mostly right. L1 cache is divided up into chunks, the size of the chunks is dependant on the processor but is in the order of 32 bytes. So, if you were stepping through memory a byte at a time, you'd get a cache miss every 32 bytes (or whatever the chunk size is). Now, the Intel chip is quite clever here and can detect sequential reads and prefetch the data reducing the impact of a cache miss.
The 4x4 matrix is highly likely to reside in a single L1 chunk (or cache line), so accessing it by row or by column makes little difference. Of course, you don't want to split the matrix across two cache lines, so well aligned memory is important.
The 16x16 matrix, however, won't fit in a cache line. So, if you're skipping through the array processing columns you're going to get lots of cache misses. The computation of the index, as jalf said, makes little difference as the ratio between CPU and memory is high (i.e. you can do a lot of CPU work for each cache miss).
Now, if you are mainly processing the matrix in a column orientated way, then your best option is to transpose all your matrices (swap rows with columns), thus your memory accesses will be more sequential and the number of cache misses will be reduced and the CPU will be able to prefetch data better. So, instead of organising the matrix like this:
0 1 2 .... 15
16 17 18 .... 31
....
240 241 242 .... 255
where the number is the memory offset from the start of the matrix, organise thus:
0 16 32 ... 240
1 17 33 ... 241
...
15 31 47 ... 255
Upvotes: 7
Reputation: 169793
Very often due to data manipulation I am forced to loop column wise [...]
You can't have it both ways: either row-wise or column-wise looping will lead to cache misses if the matrix is 'big enough' (see Skizz' answer). Optimise for the type of looping which gets executed more often.
If memory consumption is not an issue, you might also consider saving both the matrix and its transpose.
Upvotes: 0
Reputation: 1267
When I was in the school, one of my CS teacher insisted that if you make array for singledimension it will be more faster. That day I was very annoyed...
Upvotes: 1
Reputation: 41519
While the compiled code will behave equally fast, there is some design issue: reuse of the indexation code could be maximized.
The best way to do so, imho, is wrapping it up in a container that knows how to loop over it's elements in the fastest way. They got a name for this: an 'internal iterator', as mentioned in the GoF Design Patterns "Iterator" pattern.
A brief example:
template< int N >
struct CNxN {
typedef int t_row[N];
typedef t_row t_matrix[N];
t_matrix m_Contents;
template< typename Functor >
void each( Functor & f ) {
for( int col = 0; col != N; ++col )
for( int row = 0; row != N; ++row )
f( m_Contents[row][col] );
}
};
// client code
CNxN<3> matrix = { { {1,1,1},{1,1,1},{1,1,1} } };
struct sum {
long result;
sum():result(0){}
void operator()( int i ){ result +=i; }
};
matrix.each( sum );
assert(sum.result==0);
assert(has_performed_in_the_fastest_possible_way);//;)
Upvotes: 2
Reputation: 5449
The big linear array can be slightly faster if you do sequential access to the array because you save a multiply operation at every index. If you're looping column-wise then you are accessing sequentially; at least in [row][col] notation, which has been "standard" with everyone I've ever spoken to.
I doubt your 256 element array will cause cache misses on modern hardware, but I am willing to be proven wrong. What does cachegrind tell you?
Upvotes: 0
Reputation: 114815
You say that variableA*variableB+i
needs to be recalculated every time you access an element, well this happens in any case even when using multidimensional arrays. The only difference is that in multidimensional arrays the compiler emits this code so you don't see it and in the one dimensional array you see the code in your source.
Upvotes: 1
Reputation: 248269
It makes no difference. The data is laid out in the exact same way in either case, and is accessed in the same way too. I'd be surprised if it didn't generate exactly the same assembly, even.
However, with a 256byte table, you're unlikely to get cache misses in any case. The CPU's L1 cache is typically between 32 and 128KB, so I doubt you're getting many cache misses in any case.
Upvotes: 17