Krokeledocus
Krokeledocus

Reputation: 139

Missing small primes in C++ atomic prime sieve

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.

My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.

Things I've tried:

I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.

Any ideas on this are welcome, including some best practices I might have missed.

#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>

int main() {
  using namespace std;

  constexpr memory_order order = memory_order_relaxed;
  atomic<int> output{0};
  vector<atomic_bool> sieve(10000);
  for (auto& each : sieve) atomic_init(&each, false);
  atomic<unsigned> finished_worker_count{0};

  auto const worker = [&output, &sieve, &finished_worker_count]() {
    for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
      current = find_if(current, sieve.end(), [](atomic_bool& value) {
        bool untrue = false;
        return value.compare_exchange_strong(untrue, true, order);
      });
      if (current == sieve.end()) break;
      int const claimed = static_cast<int>(distance(sieve.begin(), current));
      int zero = 0;
      while (!output.compare_exchange_weak(zero, claimed, order))
        ;
      for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
           product += claimed)
        sieve[product].store(true, order);
    }
    finished_worker_count.fetch_add(1, order);
  };

  const auto core_count = thread::hardware_concurrency();
  vector<future<void>> futures;
  futures.reserve(core_count);
  generate_n(back_inserter(futures), core_count,
             [&worker]() { return async(worker); });
  vector<int> result;
  while (finished_worker_count < core_count) {
    auto current = output.exchange(0, order);
    if (current > 0) result.push_back(current);
  }
  sort(result.begin(), result.end());
  for (auto each : result) cout << each << " ";
  cout << '\n';
  return 0;
}

Upvotes: 4

Views: 161

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365157

Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.


As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.

  do{ 
     zero = 0;
  }while(!output.compare_exchange_weak(zero, claimed, order));

Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.

And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.

Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.

Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.


Multiple threads storing to the array will cause contention

Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)

Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).

But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)

With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)

A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).

Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).

You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.

The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.

(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)

At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)

Better strategy: each thread owns a chunk of the sieve

Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)

You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)

If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.

Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.

(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

Upvotes: 1

1201ProgramAlarm
1201ProgramAlarm

Reputation: 32717

compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.

You'll want to reset zero back to zero before rechecking:

while (!output.compare_exchange_weak(zero, claimed, order))
   zero = 0;

Upvotes: 4

Related Questions