iliar
iliar

Reputation: 952

Please explain cache coherence

I've recently learned about false sharing, which in my understanding stems from the CPU's attempt to create cache coherence between different cores. However, doesn't the following example demonstrate that cache coherence is violated?

The example below launches several threads that increase a global variable x, and several threads that assign the value of x to y, and an observer that tests if y>x. The condition y>x should never happen if there was memory coherence between the cores, as y is only increased after x was increased. However, this condition does happen according to the results of running this program. I tested it on visual studio both 64 and 86, both debug and release with pretty much the same results.

So, does memory coherence only happen when it's bad and never when it's good? :) Please explain how cache coherence works and how it doesn't work. If you can guide me to a book that explains the subject I'll be grateful.

edit: I've added mfence where ever possible, still there is no memory coherence (presumably due to stale cache). Also, I know the program has a data race, that's the whole point. My question is: Why is there a data race if the cpu maintains cache coherence (if it wasn't maintaining cache coherence, then what is false sharing and how does it happen?). Thank you.

#include <intrin.h>
#include <windows.h>

#include <iostream>
#include <thread>
#include <atomic>
#include <list>
#include <chrono>
#include <ratio>

#define N 1000000

#define SEPARATE_CACHE_LINES 0
#define USE_ATOMIC 0

#pragma pack(1)
struct  
{
    __declspec (align(64)) volatile long x;
#if SEPARATE_CACHE_LINES
    __declspec (align(64))
#endif
        volatile long y;
} data;

volatile long &g_x = data.x;
volatile long &g_y = data.y;

int g_observed;
std::atomic<bool> g_start;

void Observer()
{
    while (!g_start);
    for (int i = 0;i < N;++i)
    {
        _mm_mfence();
        long y = g_y;
        _mm_mfence();
        long x = g_x;
        _mm_mfence();
        if (y > x)
        {
            ++g_observed;
        }
    }
}

void XIncreaser()
{
    while (!g_start);
    for (int i = 0;i < N;++i)
    {
#if USE_ATOMIC
        InterlockedAdd(&g_x,1);
#else
        _mm_mfence();
        int x = g_x+1;
        _mm_mfence();
        g_x = x;
        _mm_mfence();
#endif
    }
}

void YAssigner()
{
    while (!g_start);
    for (int i = 0;i < N;++i)
    {
#if USE_ATOMIC
        long x = g_x;
        InterlockedExchange(&g_y, x);
#else
        _mm_mfence();
        int x = g_x;
        _mm_mfence();
        g_y = x;
        _mm_mfence();
#endif
    }
}

int main()
{
    using namespace std::chrono;

    g_x = 0;
    g_y = 0;
    g_observed = 0;
    g_start = false;

    const int NAssigners = 4;
    const int NIncreasers = 4;

    std::list<std::thread> threads;

    for (int i = 0;i < NAssigners;++i)
    {
        threads.emplace_back(YAssigner);
    }
    for (int i = 0;i < NIncreasers;++i)
    {
        threads.emplace_back(XIncreaser);
    }
    threads.emplace_back(Observer);

    auto tic = high_resolution_clock::now();
    g_start = true;

    for (std::thread& t : threads)
    {
        t.join();
    }

    auto toc = high_resolution_clock::now();

    std::cout << "x = " << g_x << " y = " << g_y << " number of times y > x = " << g_observed << std::endl;
    std::cout << "&x = " << (int*)&g_x << " &y = " << (int*)&g_y << std::endl;
    std::chrono::duration<double> t = toc - tic;
    std::cout << "time elapsed = " << t.count() << std::endl;
    std::cout << "USE_ATOMIC = " << USE_ATOMIC << " SEPARATE_CACHE_LINES = " << SEPARATE_CACHE_LINES << std::endl;

    return 0;
}

Example output:

x = 1583672 y = 1583672 number of times y > x = 254
&x = 00007FF62BE95800 &y = 00007FF62BE95804
time elapsed = 0.187785
USE_ATOMIC = 0 SEPARATE_CACHE_LINES = 0

Upvotes: 2

Views: 445

Answers (1)

mevets
mevets

Reputation: 10435

False sharing is mainly related to performance, not coherence or program order. The cpu cache works on a granularity which is typically 16, 32, 64,... bytes. That means if two independent data items are close together in memory, they will experience each others cache operations. Specifically, if &a % CACHE_LINE_SIZE == &b % CACHE_LINE_SIZE, then they will share a cache line.

For example, if cpu0 & 1 are fighting over a, and cpu 2 & 3 are fighting over b, the cache line containing a & b will thrash between each of the 4 caches. This is the effect of false sharing, and it causes a large performance drop.

False sharing happens because the coherence algorithm in the caches demand that there is a consistent view of memory. A good way to examine it is to put two atomic counters in a structure spaced out by one or two k:

struct a {
      long    a;
      long    pad[1024];
      long    b;
};

and find a nice little machine language function to do an atomic increment. Then cut loose NCPU/2 threads incrementing a and NCPU/2 threads incrementing b until they reach a big number. Then repeat, commenting out the pad array. Compare the times.

When you are trying to get at machine details, clarity and precision are your friends; C++ and weird attribute declarations aren’t.

Upvotes: 4

Related Questions