Reputation: 823
My code does the following:
I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.
Here is some stand-alone C++ code to reproduce this effect:
#include <stdio.h>
#include <stdint.h>
#define LEN 128
#define USELESS 1000000000
//#define USELESS 0
// Read timestamp counter
static inline long long get_cycles()
{
unsigned low, high;
unsigned long long val;
asm volatile ("rdtsc" : "=a" (low), "=d" (high));
val = high;
val = (val << 32) | low;
return val;
}
// Compute a simple hash
static inline uint32_t hash(uint32_t *arr, int n)
{
uint32_t ret = 0;
for(int i = 0; i < n; i++) {
ret = (ret + (324723947 + arr[i])) ^ 93485734985;
}
return ret;
}
int main()
{
uint32_t sum = 0; // For adding dependencies
uint32_t arr[LEN]; // We'll compute the hash of this array
for(int iter = 0; iter < 3; iter++) {
// Create a new array to hash for this iteration
for(int i = 0; i < LEN; i++) {
arr[i] = (iter + i);
}
// Do intense computation
for(int useless = 0; useless < USELESS; useless++) {
sum += (sum + useless) * (sum + useless);
}
// Do the latency-critical task
long long start_cycles = get_cycles() + (sum & 1);
sum += hash(arr, LEN);
long long end_cycles = get_cycles() + (sum & 1);
printf("Iteration %d cycles: %lld\n", iter, end_cycles - start_cycles);
}
}
When compiled with -O3
with USELESS
set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with USELESS
set to 0, the iterations took 394, 358, and 362 cycles, respectively.
Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.
Upvotes: 2
Views: 105
Reputation: 32732
Moving my speculative comment to an answer:
It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr
data out of the L1 cache, so that the first memory access in hash
needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr
initialization to after the computation loop, just to see what the effect is.
Upvotes: 1