rampatowl
rampatowl

Reputation: 1772

Comparing latency of mutex vs. spinlock wake time

I am curious about the latency of a mutex vs. spinlock, measured as the time between when one thread unlocks it and another waiting thread can access it. I wrote two tests in C++ and see ~15000 nanoseconds for mutex and ~300 nanoseconds for spinlocks. I am new to multithreading synchronization, so I would first like to verify that these tests are implemented correctly and that the results are reasonable.

I am also curious why the spinlock still shows a latency of ~1000 clock cycles, even though the atomic operations alone are fast (~100 clock cycles) and the spinning CPU should be able to pick up on the unlock as soon as it occurs. Are there any ways to speed this up?

The results are fairly consistent, and I checked that the time measurement itself has latency <60ns.

Mutex test

// contended_mutex_overhead.cpp

#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Global clock
auto t = std::chrono::high_resolution_clock::now();
uint64_t cycles_count;

// Global mutex lock
std::mutex m;

int slowthread() {
  m.lock();

  // Do work that takes some time, so that main thread is waiting on lock
  int sum = 0;
  for (int i=0; i<1000000000; ++i) {
    sum += i;
  }

  // Mark start time, right before lock is returned
  t = std::chrono::high_resolution_clock::now();
  cycles_count = rdtsc();
  m.unlock();
  return sum;
}

int main() {
  std::thread thr(slowthread);

  // Delay the main() thread with a std::cout call to make sure slowthread() gets a chance
  // to grab the lock before main() requests it in the next line.
  std::cout << "hi\n";

  // The lock will not be available at this time, until slowthread's blocking work completes
  m.lock();

  // Measure the end time, and calculate the delta from slowthread's release of the lock to when lock is given to main()
  auto finish = std::chrono::high_resolution_clock::now();

  auto cycles2 = rdtsc();
  auto d = std::chrono::duration_cast<std::chrono::nanoseconds>(finish-t).count();
  std::cout << d << " nanosecond overhead for contended mutex wake\n";
  std::cout << cycles2 - cycles_count << " cycle overhead for contended mutex wake\n";
  thr.join();
}

Output:

> g++ -O3 -pthread contended_mutex_overhead.cpp; ./a.out
hi
14763 nanosecond overhead for contended mutex wake
42786 cycle overhead for contended mutex wake

Spinlock test

// contended_spinlock_overhead.cpp
#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>
#include <atomic>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

struct spinlock {
  std::atomic<bool> lock_ = {false};

  void lock() {
    while(lock_.exchange(true));
  }

  void unlock() { lock_.store(false); }
};

// Global clock
auto t = std::chrono::high_resolution_clock::now();
uint64_t cycles_count;

// Global spinlock
spinlock spinner;

int slowthread() {
  spinner.lock();

  // Do work that takes some time, so that main thread is waiting on lock
  int sum = 0;
  for (int i=0; i<1000000000; ++i) {
    sum += i;
  }

  // Mark start time, right before lock is returned
  t = std::chrono::high_resolution_clock::now();
  cycles_count = rdtsc();
  spinner.unlock();
  return sum;
}

int main() {
  std::thread thr(slowthread);

  // Delay the main() thread with a std::cout call to make sure slowthread() gets a chance
  // to grab the lock before main() requests it.
  std::cout << "hi\n";

  // The lock will not be available at this time, until slowthread's blocking work completes
  spinner.lock();

  // Measure the end time, and calculate the delta from slowthread's release of the lock to when lock is given to main()
  auto finish = std::chrono::high_resolution_clock::now();

  auto cycles2 = rdtsc();
  auto d = std::chrono::duration_cast<std::chrono::nanoseconds>(finish-t).count();
  std::cout << d << " nanosecond overhead for spinlock wake\n";
  std::cout << cycles2 - cycles_count << " cycle overhead for spinlock wake\n";
  thr.join();
}

Output:

hi
363 nanosecond overhead for spinlock wake
1148 cycle overhead for spinlock wake

Upvotes: 3

Views: 1716

Answers (1)

Arty
Arty

Reputation: 16737

Spinlock and mutex are both MUCH more performant than you think if measurements are done correctly.

My following code on my old 2GHz Windows laptop shows 75 ns for mutex and 12.5 ns for spinlock. And on modern 3GHz GodBolt online Linux servers mutex shows on average 15 ns and spinlock shows 8 ns. See more details after code, there is located console output of measurements.

GodBolt Linux mutex is much faster, because in Linux is implemented in much more performant way than in Windows. On Windows you may try std::shared_mutex instead of std::mutex, because shared was implemented in much later version of Windows and was made through more performant algorithm and API, shared version of mutex shows 50 ns on my laptop, while regular mutex is 75 ns.

I also measure cycles and GHz that are shown in console output. Note that as cycles are measured by RDTSC instruction, this cycles count corresponds to Base frequence, it means if currently CPU has thermal throttling or turbo boost, i.e. has changed its speed from Base, then number of cycles will be shown incorrectly. Only number of nano-seconds (ns) is always shown correctly, corresponding to current CPU speed.

How to correctly measure everything here. First to ONLY measure time of spinlock and mutex itself without any extra work you need to measure only single thread.

But to measure only single thread you need to Show compiler that you're Going to use mutex and spinlock in other threads. For that I create a second dummy thread through std::async and keep its future in volatile variable. This dummy thread prevents compiler from optimizing away mutex and spinlock code, as it sees that it is used in other thread simultaneously. Without dummy thread mutex and spinlock code might be removed from single (main) thread.

Second dummy thread actually does nothing, it locks mutex and spinlock just once and quits. So this second thread will not disturb our measurements.

One more important thing is regarding std::atomic_flag - you have to use more relaxed memory order, specifically in my code you can see that I use std::memory_order_acquire in locking line of spinlock, and std::memory_order_release in release line. This tighter memory order will help to get more speed from spinlock, by removing unnecessary operations. By default a much more slower sequential-only memory order std::memory_order_seq_cst is used.

Next very important thing, is probably a main reason of wrong results in your code - you should do Many measurements and choose smallest one. Why smallest? Because if you measure for 1 second then during this period operating system (Windows/Linux/MacOS) will rotate (re-schedule) all threads many times, around 1 time in each 10-15 milli-second. This will cause very huge pause between each wake-ups of each thread, and spoil all our results.

To avoid this timings spoiling when threads are re-scheduled, we do two things - first we measure quite small amount of time, around 50-100 micro-seconds, second - we do this many times (10-15 times). For any operating system we're quite guaranteed that at least one 50-micro-seconds measurement will be correct, i.e. will not include re-schedule of threads.

100 micro seconds time is quite alright for precision of std::chrono::high_resolution_clock, it usually has a resolution of about 200-500 nano-seconds.

Taking mininmal time out of 15 measurements will help us to avoid other hardware delays. Because minimal time will show most performant result, almost this result will usually happen on average later in real systems.

I also use RDTSC to measure cycles, this is only to show in console besides nano-seconds ns, to show also cycles and GHz of CPU.

Tiny more improvement of speed is by repeating lock/unlock of both mutex and spinlock two times inside main measurement loop. This is a small manual loop-unrolling, to be even more precise and do less loop variable increments and comparisons.

Try it online on GodBolt!

#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>

#include <immintrin.h>

int main() {
    auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
    auto Time = []() -> uint64_t {
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    };
    std::cout << std::fixed << std::setprecision(2);
    size_t constexpr nloops = 1 << 4, ntests = 1 << 13;
    {
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        auto volatile ares = std::async(std::launch::async, [&]{ while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); });
        for (size_t j = 0; j < nloops; ++j) {
            auto tim = Time(), timr = Rdtsc();
            for (size_t i = 0; i < ntests / 2; ++i) {
                { while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
                { while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
        std::cout << "Spinlock time: " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
            << " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
    }
    {
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        auto volatile ares = std::async(std::launch::async, [&]{ std::lock_guard<std::mutex> lock(mux); });
        for (size_t j = 0; j < nloops; ++j) {
            auto tim = Time(), timr = Rdtsc();
            for (size_t i = 0; i < ntests / 2; ++i) {
                { std::lock_guard<std::mutex> lock(mux); }
                { std::lock_guard<std::mutex> lock(mux); }
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
        std::cout << "Mutex time   : " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
            << " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
    }
}

Output of modern 3GHz Linux GodBolt online server:

Spinlock time: 7.65 ns (22.95 cycles, 3.00 GHz)
Mutex time   : 15.59 ns (46.78 cycles, 3.00 GHz)

Output of old 2GHz Windows laptop:

Spinlock time: 12.38 ns (25.94 cycles, 2.10 GHz)
Mutex time   : 74.27 ns (155.58 cycles, 2.09 GHz)

(std::shared_mutex time for PC above is 50 ns)


One can make a program more complex by running multiple threads to measure how timings behave in a case if multiple threads race to get lock of same atomic flag or mutex.

Following program tests both single threaded and multi threaded timings. Plus it also does increment of a counter as a kind of spinlock/mutex-protected work.

To get precise results program sets affinity of a thread by assigning different threads to different cores.

Multi threaded version uses std::barrier to do start/stop of measurements at same precise points of time. Also it uses atomic counters to synchronize speed of each thread so that they advance at same rate.

Timings of multi threaded version should be almost equal to single threaded version, especially it will be close on modern CPUs. Timings of multi threaded can be MUCH bigger only if chosen (by affinity) cores are occupied too much by operating system. Also timings can be bigger if some operating system re-schedules threads in some strange order. It happened to me that Windows multi-threaded timings were big, while Linux timings where totally alright.

See timings after code. Note that GodBolt servers' timings are very tiny. Spinlock (both single and multi threaded) is 8 ns (23 cycles), which corresponds to timings of atomic flag's main instruction XCHG (17 cycles here). Mutex is 16 ns (50 cycles), almost same as spinlock, because in Linux it is implemented with atomic flag, extra overhead is due to syscall operation or CALL instruction.

Try it online!

#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>
#include <barrier>
#include <optional>
#include <thread>
#include <vector>

#include <immintrin.h>

#if defined(_MSC_VER) && !defined(__clang__)
    #define FINL [[msvc::forceinline]]
#else
    #define FINL __attribute__((always_inline))
#endif

#ifdef _WIN32
    #include <windows.h>
    // https://stackoverflow.com/a/56486809/941531
    inline void SetAffinity(std::thread & thr, size_t i) {
        if (SetThreadAffinityMask(thr.native_handle(), DWORD_PTR(1) << i) == 0)
            std::cout << "SetThreadAffinityMask failed, GLE = " << GetLastError() << std::endl;
    }
#else
    #include <pthread.h>
    // https://stackoverflow.com/a/57620568/941531
    inline void SetAffinity(std::thread & thr, size_t i) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(i, &cpuset);
        int rc = pthread_setaffinity_np(thr.native_handle(), sizeof(cpu_set_t), &cpuset);
        if (rc != 0)
            std::cout << "Error calling pthread_setaffinity_np: " << rc << std::endl;
    }
#endif

int main() {
    auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
    auto Time = []() -> uint64_t {
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    };
    std::cout << std::fixed << std::setprecision(2);
    
    auto RunST = [&Rdtsc, &Time](auto const & ReLock, auto nloops, auto ntests, auto & min_time, auto & min_rdtsc){
        { auto volatile ares = std::async(std::launch::async, [&]{ ReLock(); }); }
        for (size_t j = 0; j < nloops; ++j) {
            auto timr = Rdtsc(), tim = Time();
            for (size_t i = 0; i < ntests / 4; ++i) {
                ReLock(); ReLock(); ReLock(); ReLock();
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
    };
    
    auto RunMT = [&Rdtsc, &Time](auto nthr, auto ithr, auto const & ReLock, auto nloops, auto ntests, auto & barrier, auto & timings, auto & min_time, auto & min_rdtsc, auto & brs){
        for (size_t j = 0; j < nloops; ++j) {
            barrier.arrive_and_wait();
            size_t nibr = (ithr + 1) % nthr;
            uint32_t ibrc = 0;
            auto ReSync = [&]{
                ibrc += 16;
                brs[ithr].store(ibrc, std::memory_order_relaxed);
                while (ibrc > brs[nibr].load(std::memory_order_relaxed)) {}
            };
            ReSync();
            auto timr = Rdtsc(), tim = Time();
            for (size_t i = 0; i < ntests / 16; ++i) {
                ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
                ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
                ReSync();
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            timings[ithr] = {tim, timr};
            barrier.arrive_and_wait();
            if (ithr != 0)
                continue;
            
            std::sort(timings.begin(), timings.end());
            if (timings.back().first >= timings.front().first * 1.3)
                continue;
            
            tim = 0; timr = 0;
            for (size_t i = 0; i < timings.size(); ++i) {
                tim += timings[i].first;
                timr += timings[i].second;
            }
            tim /= timings.size();
            timr /= timings.size();
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
    };

    auto const hard_threads = std::thread::hardware_concurrency();
    auto const test_threads = 2; //std::max<size_t>(2, hard_threads / 2);
    std::cout << test_threads << " threads." << std::endl;
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        auto const ReLock = [&f, &vcnt]() FINL {
            while (f.test_and_set(std::memory_order_acquire)) {}
            ++vcnt;
            f.clear(std::memory_order_release);
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry)
            RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
        std::cout << "Single-Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
        size_t const nthr = test_threads;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        std::barrier barrier(nthr, []() noexcept {});
        std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
        std::vector<std::optional<std::thread>> threads(nthr);
        auto const ReLock = [&f, &vcnt]() FINL {
            while (f.test_and_set(std::memory_order_acquire)) {}
            ++vcnt;
            f.clear(std::memory_order_release);
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry) {
            std::vector<std::atomic<uint32_t>> brs(nthr);
            for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
                threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
                SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
            }
            for (auto & t: threads)
                t->join();
            threads.clear();
            threads.resize(nthr);
        }
        std::cout << "Multi -Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        auto const ReLock = [&mux, &vcnt]() FINL {
            std::lock_guard<std::mutex> lock(mux);
            ++vcnt;
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry)
            RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
        std::cout << "Single-Threaded Mutex    time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
        size_t const nthr = test_threads;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        std::barrier barrier(nthr, []() noexcept {});
        std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
        std::vector<std::optional<std::thread>> threads(nthr);
        auto const ReLock = [&mux, &vcnt]() FINL {
            std::lock_guard<std::mutex> lock(mux);
            ++vcnt;
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry) {
            std::vector<std::atomic<uint32_t>> brs(nthr);
            for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
                threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
                SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
            }
            for (auto & t: threads)
                t->join();
            threads.clear();
            threads.resize(nthr);
        }
        std::cout << "Multi -Threaded Mutex    time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
    }
}

Output on GodBolt Linux servers:

2 threads.
Single-Threaded Spinlock time:   8.63 ns ( 25.91 cycles, 3.00 GHz), test time 0.31 sec, precision 1.00000000
Multi -Threaded Spinlock time:   8.51 ns ( 25.58 cycles, 3.00 GHz), test time 1.04 sec, precision 1.00000000
Single-Threaded Mutex    time:  15.89 ns ( 47.66 cycles, 3.00 GHz), test time 0.82 sec, precision 1.00000000
Multi -Threaded Mutex    time:  17.94 ns ( 53.86 cycles, 3.00 GHz), test time 4.18 sec, precision 1.00000000

Output on old Windows laptop:

2 threads.
Single-Threaded Spinlock time:  31.40 ns ( 67.22 cycles, 2.14 GHz), test time 0.48 sec, precision 1.00000000
Multi -Threaded Spinlock time:  32.45 ns ( 70.21 cycles, 2.16 GHz), test time 0.98 sec, precision 1.00000000
Single-Threaded Mutex    time:  79.69 ns (168.36 cycles, 2.11 GHz), test time 0.94 sec, precision 1.00000000
Multi -Threaded Mutex    time:  83.67 ns (269.23 cycles, 3.22 GHz), test time 0.91 sec, precision 1.00000000

Upvotes: 3

Related Questions