Premise I want to do some kind of computation involving k long vectors of data (each of length n ), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely for(i = 0; i < n; i++) v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i]) or perhaps for(i = 0; i < n; i++) v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i]) (this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more). Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i() ) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. k x n ) invocations of foo() (resp. bar_i() ). Question How do I arrange this computation, so as to avoid: The data from one input vector clearing out cached data for another vector. Input vector data clearing out cached output vector data. Intermediary results of bar_j(...bar_1(v_1[i])...) remaining in registers or L1 cache if there's enough capacity to hold them there until the data for v_{j+1}[i] ... v_k[i] arrives and allows us to complete the computation. Same for L2. The L1 cache lines of the output vectors being cleared while we intend to continue working on elements in that cache line. Same for L2. Under-utilization of memory bandwidth. core idle time, to the extent possible. Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat). Notes: Any rearrangement of the input data counts towards the total computation time. The vectors will not be re-used in the alternate arrangement so it's basically a waste of time. If it makes things easier for you to assume alignment, or lack of alignment, that's fine, just say so. Computing with the bar_i functions allows more flexibility with the access patters, but creates additional challenges w.r.t. the caching of v_out values.

Reputation: 131920

How not to mess up the cache when working on some long vectors in memory?

Premise

I want to do some kind of computation involving k long vectors of data (each of length n), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely

for(i = 0; i < n; i++)
    v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])

or perhaps

for(i = 0; i < n; i++)
    v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])

(this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more).

Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i()) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. kxn) invocations of foo() (resp. bar_i()).

Question

How do I arrange this computation, so as to avoid:

The data from one input vector clearing out cached data for another vector.
Input vector data clearing out cached output vector data.
Intermediary results of bar_j(...bar_1(v_1[i])...) remaining in registers or L1 cache if there's enough capacity to hold them there until the data for v_{j+1}[i] ... v_k[i] arrives and allows us to complete the computation. Same for L2.
The L1 cache lines of the output vectors being cleared while we intend to continue working on elements in that cache line. Same for L2.
Under-utilization of memory bandwidth.
core idle time, to the extent possible.
Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).

Notes:

Any rearrangement of the input data counts towards the total computation time. The vectors will not be re-used in the alternate arrangement so it's basically a waste of time.
If it makes things easier for you to assume alignment, or lack of alignment, that's fine, just say so.
Computing with the bar_i functions allows more flexibility with the access patters, but creates additional challenges w.r.t. the caching of v_out values.

Upvotes: 4

Answers (2)

VAndrei

Reputation: 5580

The data from one input vector clearing out cached data for another vector.

If you are using v_1[i], v_2[i], ..., v_k[i] in the same function call, the input vectors will not clear out data cached for other vectors. For each element you read in a vector, the CPU will fetch just a cache line, not the entire vector. So if you read k elements, you will bring k cache lines from each vector.

Input vector data clearing out cached output vector data.

You have the same case as the above. That won't be the case.

Output vector data remaining in registers or L1 cache for as long as we need it (in the bar case).

You could try to use _mm_prefetch intrinsics to fetch the data before writing it.

Under-utilization of memory bandwidth.

To do that you need to maximize the number of full width transactions. Basically you need that when the CPU fetches a cache line, all the elements are used right away. To do this you must rearrange your data. I would consider all the k vectors as a matrix of k x n elements, stored in a column major format.

type* pMat = (type*)aligned_alloc(CACHE_LINE_SIZE, n * k * sizeof(type));
v_0[i] = pMat[i * k + 0];
v_1[i] = pMat[i * k + 1];
// ...
v_k-1[i] = pMat[i * k + k-1];

That will put the v_0, ... v_k elements in SIMD registers and you might have the chance of better vectorization.

core idle time, to the extent possible.

Less cache misses, less transcedental instructions will lead to less idle time.

Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).

You could decrease the price of the sequences using prefetching (_mm_prefetch).

Upvotes: 2

Beta Carotin

Reputation: 1679

To reduce the clearing out of cached data, you could rearrange your k vectors to 1 vector holding a structure with k members. That way, the loop will access those elments in sequence and not jump around in memory.

struct VectorData
{
    Type1 Var1;
    Type2 Var2;
    // ...
    TypeK VarK;
};

std::vector<VectorData> v_in;

for (i = 0; i < n; i++){
    v_out[i] = foo(v_in[i].Var1, v_in[i].Var2, ... , v_in[i].VarK);
    // Or just pass the whole element:
    v_out[i] = foo(v_in[i]);
}

Upvotes: 1

How not to mess up the cache when working on some long vectors in memory?

Answers (2)

Related Questions