Busy loop slows down latency-critical computation

Question

My code does the following:

Do some long-running intense computation (called useless below)
Do a small latency-critical task

I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.

Here is some stand-alone C++ code to reproduce this effect:

    #include 
    #include 

    #define LEN 128
    #define USELESS 1000000000
    //#define USELESS 0

    // Read timestamp counter
    static inline long long get_cycles()
    {
            unsigned low, high;
            unsigned long long val;
            asm volatile ("rdtsc" : "=a" (low), "=d" (high));
            val = high;
            val = (val << 32) | low;
            return val;
    }

    // Compute a simple hash
    static inline uint32_t hash(uint32_t *arr, int n)
    {
            uint32_t ret = 0;
            for(int i = 0; i < n; i++) {
                    ret = (ret + (324723947 + arr[i])) ^ 93485734985;
            }
            return ret;
    }

    int main()
    {
            uint32_t sum = 0;       // For adding dependencies
            uint32_t arr[LEN];      // We'll compute the hash of this array

            for(int iter = 0; iter < 3; iter++) {
                    // Create a new array to hash for this iteration
                    for(int i = 0; i < LEN; i++) {
                            arr[i] = (iter + i);
                    }

                    // Do intense computation
                    for(int useless = 0; useless < USELESS; useless++) {
                            sum += (sum + useless) * (sum + useless);
                    }

                    // Do the latency-critical task
                    long long start_cycles = get_cycles() + (sum & 1);
                    sum += hash(arr, LEN);
                    long long end_cycles = get_cycles() + (sum & 1);

                    printf("Iteration %d cycles: %lld
", iter, end_cycles - start_cycles);
            }
    }

When compiled with -O3 with USELESS set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with USELESS set to 0, the iterations took 394, 358, and 362 cycles, respectively.

Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.

1201ProgramAlarm · Accepted Answer

Moving my speculative comment to an answer:

It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr data out of the L1 cache, so that the first memory access in hash needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr initialization to after the computation loop, just to see what the effect is.

Busy loop slows down latency-critical computation

Answers (1)

Related Questions