einpoklum
einpoklum

Reputation: 131920

How not to mess up the cache when working on some long vectors in memory?

Premise

I want to do some kind of computation involving k long vectors of data (each of length n), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely

for(i = 0; i < n; i++)
    v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])

or perhaps

for(i = 0; i < n; i++)
    v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])

(this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more).

Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i()) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. kxn) invocations of foo() (resp. bar_i()).

Question

How do I arrange this computation, so as to avoid:

Notes:

Upvotes: 4

Views: 184

Answers (2)

VAndrei
VAndrei

Reputation: 5580

The data from one input vector clearing out cached data for another vector.

If you are using v_1[i], v_2[i], ..., v_k[i] in the same function call, the input vectors will not clear out data cached for other vectors. For each element you read in a vector, the CPU will fetch just a cache line, not the entire vector. So if you read k elements, you will bring k cache lines from each vector.

Input vector data clearing out cached output vector data.

You have the same case as the above. That won't be the case.

Output vector data remaining in registers or L1 cache for as long as we need it (in the bar case).

You could try to use _mm_prefetch intrinsics to fetch the data before writing it.

Under-utilization of memory bandwidth.

To do that you need to maximize the number of full width transactions. Basically you need that when the CPU fetches a cache line, all the elements are used right away. To do this you must rearrange your data. I would consider all the k vectors as a matrix of k x n elements, stored in a column major format.

type* pMat = (type*)aligned_alloc(CACHE_LINE_SIZE, n * k * sizeof(type));
v_0[i] = pMat[i * k + 0];
v_1[i] = pMat[i * k + 1];
// ...
v_k-1[i] = pMat[i * k + k-1];

That will put the v_0, ... v_k elements in SIMD registers and you might have the chance of better vectorization.

core idle time, to the extent possible.

Less cache misses, less transcedental instructions will lead to less idle time.

Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).

You could decrease the price of the sequences using prefetching (_mm_prefetch).

Upvotes: 2

Beta Carotin
Beta Carotin

Reputation: 1679

To reduce the clearing out of cached data, you could rearrange your k vectors to 1 vector holding a structure with k members. That way, the loop will access those elments in sequence and not jump around in memory.

struct VectorData
{
    Type1 Var1;
    Type2 Var2;
    // ...
    TypeK VarK;
};

std::vector<VectorData> v_in;

for (i = 0; i < n; i++){
    v_out[i] = foo(v_in[i].Var1, v_in[i].Var2, ... , v_in[i].VarK);
    // Or just pass the whole element:
    v_out[i] = foo(v_in[i]);
}

Upvotes: 1

Related Questions