Amir Ofir
Amir Ofir

Reputation: 136

Hitting CPU cache for better performance in C++

Is working on continuous buffers preferred for CPU caching?

I am trying to write an application that does several operations on image (some very local such as shifting, derivatives and some between results like subtraction). I have a large buffer for the results (each calculation has the shape of an image, so I start with allocation of X * image shape bytes)

What should I do for maximizing the CPU cache hits?

Upvotes: 1

Views: 1000

Answers (2)

Moshe Gottlieb
Moshe Gottlieb

Reputation: 4003

Sure, working on a continuous array is more likely to hit cache.
You may want to arrange your data in a continuous way:
For instance:

#include <iostream>
#include <cstdint>

const int SIZE = 3;

int main(){
    uint8_t buffer_2d[SIZE][SIZE];
    uint8_t* buffer_1d = reinterpret_cast<uint8_t*>(buffer_2d); // or just do uint8_t buffer_1d[SIZE*SIZE];
    const auto base = &(buffer_2d[0][0]);
    for (int y=0;y<SIZE;++y){
        for (int x=0;x<SIZE;++x){
            std::cout << "x: " << x << " y: " << y << " offset for [x][y]: " << &(buffer_2d[x][y]) - base << " offset for [y][x]: " << &(buffer_2d[y][x]) - &(buffer_2d[0][0]) << " offset for [y*SIZE+x]: " << &(buffer_1d[y*SIZE+x]) - base <<  std::endl;
        }
    }
    return 0;
}

Treating the array as a human natural [x][y] array would not be efficient, as the data is not aligned this way, the efficient approach would be to use [y][x] or work on the array as a single dimension array and treat the index as y*LINE_SIZE + x.
Here is the output for this test showing exactly that:

x: 0 y: 0 offset for [x][y]: 0 offset for [y][x]: 0 offset for [y*SIZE+x]: 0
x: 1 y: 0 offset for [x][y]: 3 offset for [y][x]: 1 offset for [y*SIZE+x]: 1
x: 2 y: 0 offset for [x][y]: 6 offset for [y][x]: 2 offset for [y*SIZE+x]: 2
x: 0 y: 1 offset for [x][y]: 1 offset for [y][x]: 3 offset for [y*SIZE+x]: 3
x: 1 y: 1 offset for [x][y]: 4 offset for [y][x]: 4 offset for [y*SIZE+x]: 4
x: 2 y: 1 offset for [x][y]: 7 offset for [y][x]: 5 offset for [y*SIZE+x]: 5
x: 0 y: 2 offset for [x][y]: 2 offset for [y][x]: 6 offset for [y*SIZE+x]: 6
x: 1 y: 2 offset for [x][y]: 5 offset for [y][x]: 7 offset for [y*SIZE+x]: 7
x: 2 y: 2 offset for [x][y]: 8 offset for [y][x]: 8 offset for [y*SIZE+x]: 8

the last two results are using exactly the same semantics, the first one will let the compiler emit the calculation, but performance should be the same.

Additionally, once your data is arranged correctly, depending on what you do with the data, you may want to use OpenCL or something and utilize the GPU or SIMD, may lead significant performance improvements if can be expressed in SIMD (Single Instruction Multiple Data) code.

Upvotes: 1

#ifdef _MSC_VER
        _declspec(align(64))    unsigned char  block_hashfp;
#else
        __attribute__((aligned(64))) unsigned char  block_hashfp;
#endif

will hit l2 cache

Upvotes: 0

Related Questions