volatile increments with false sharing run slower in release than in debug when 2 threads are sharing the same physical core

I'm trying to test the performance impact of false sharing. The test code is as below:

constexpr uint64_t loop = 1000000000;

struct no_padding_struct {
    no_padding_struct() :x(0), y(0) {}
    uint64_t x;
    uint64_t y;
};

struct padding_struct {
    padding_struct() :x(0), y(0) {}
    uint64_t x;
    char padding[64];
    uint64_t y;
};

alignas(64) volatile no_padding_struct n;
alignas(64) volatile padding_struct p;

constexpr core_a = 0;
constexpr core_b = 1;

void func(volatile uint64_t* addr, uint64_t b, uint64_t mask) {
    SetThreadAffinityMask(GetCurrentThread(), mask);
    for (uint64_t i = 0; i < loop; ++i) {
        *addr += b;
    }
}

void test1(uint64_t a, uint64_t b) {
    thread t1{ func, &n.x, a, 1<<core_a };
    thread t2{ func, &n.y, b, 1<<core_b };

    t1.join();
    t2.join();
}

void test2(uint64_t a, uint64_t b) {
    thread t1{ func, &p.x, a, 1<<core_a  };
    thread t2{ func, &p.y, b, 1<<core_b  };

    t1.join();
    t2.join();
}

int main() {
    uint64_t a, b;
    cin >> a >> b;


    auto start = std::chrono::system_clock::now();
    //test1(a, b);
    //test2(a, b);
    auto end = std::chrono::system_clock::now();
    cout << (end - start).count();
}

The result was mostly as follow:

x86                                         x64             
cores    test1           test2              cores       test1        test2  
         debug  release  debug  release               debug release  debug  release
0-0      4.0s   2.8s     4.0s   2.8s        0-0       2.8s  2.8s     2.8s   2.8s
0-1      5.6s   6.1s     3.0s   1.5s        0-1       4.2s  7.8s     2.1s   1.5s
0-2      6.2s   1.8s     2.0s   1.4s        0-2       3.5s  2.0s     1.4s   1.4s
0-3      6.2s   1.8s     2.0s   1.4s        0-3       3.5s  2.0s     1.4s   1.4s
0-5      6.5s   1.8s     2.0s   1.4s        0-5       3.5s  2.0s     1.4s   1.4s

test result in image

My CPU is intel core i7-9750h. 'core0' and 'core1' are of a physical core, and so does 'core2' and 'core3' and others. MSVC 14.24 was used as the compiler.

The time recorded was an approximate value of the best score in several runs since there were tons of background tasks. I think this was fair enough since the results can be clearly divided into groups and 0.1s~0.3s error did not affect such division.

Test2 was quite easy to explain. As x and y are in different cache lines, running on 2 physical core can gain 2 times performance boost(the cost of context switch when running 2 threads on a single core is ignorable here), and running on one core with SMT is less efficient than 2 physical cores, limited by the throughput of coffee-lake(believe Ryzen can do slightly better), and more efficient than temporal multithreading. It seems 64bit mode is more efficient here.

But the result of test1 is confusing to me. First, in debug mode, 0-2, 0-3, and 0-5 are slower than 0-0, which makes sense. I explained this as certain data was moved from L1 to L3 and L3 to L1 repeatedly since the cache must stay coherent among 2 cores, while it would always stay in L1 when running on a single core. But this theory conflicts with the fact that 0-1 pair is always the slowest. Technically, the two threads should share the same L1 cache. 0-1 should run 2 times as fast as 0-0.

Second, in release mode, 0-2, 0-3, and 0-5 were faster than 0-0, which disproved the theory above.

Last, 0-1 runs slower in release than in debug in both 64bit and 32bit mode. That's what I can't understand most. I read the generated assembly code and did not find anything helpful.

Upvotes: 3

Answers (2)

Yuki N

Reputation: 83

@PeterCordes Thank you for your analysis and advice. I finally profiled the program using Vtune and it turns out your expectation was correct.

When running on SMT threads of the same core, machine_clear consumes lots of time, and it was more severe in Release than in Debug. This happens on both 32bit and 64bit mode.

When running on different physical cores the bottleneck was memory(store latency and false sharing)，and Release was always faster since it contains significantly fewer memory access than Debug in critical part, as shown in Debug assembly(godbolt) and Release assembly(godbolt). The total instruction retired is also fewer in Release, which strengthens this point. It seems the assembly I found in Visual Studio yesterday was not correct.

Upvotes: 1

Sean F

Reputation: 4615

This might be explained by hyper-threading. Cores being shared as 2 hyperthread cores do not get double the throughout like 2 entirely separate cores might. Instead you might get something like 1.7 times the performance.

Indeed, your processor has 6 cores and 12 threads, and core0/core1 are 2 threads on the same underlying core, if I am reading all this correctly.

In fact, if you picture in your mind how hyper-threading works, with the work of 2 separate cores interleaved, it is not surprising.

Upvotes: -1

volatile increments with false sharing run slower in release than in debug when 2 threads are sharing the same physical core

Answers (2)

Related Questions