Reputation: 19
I'm working on a lock-free stack implementation using C++ and CAS (compare_exchange_strong/compare_exchange_weak). The implementation seems to work fine on modern Linux systems but fails on Windows and some older versions of Linux. Below is the complete code for testing multi-producer, multi-consumer thread safety:
#include <unordered_set>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <string>
#include <vector>
#include <iostream>
#include <atomic>
#include <thread>
#include <mutex>
using namespace std;
template <typename T>
class LockFreeStack {
public:
struct Node {
T data;
Node* next;
Node(const T& value) : data(value), next(nullptr) {}
};
LockFreeStack() : head(nullptr) {}
~LockFreeStack() {
while (pop());
}
bool empty() {
return head == nullptr;
}
void push(const T& value) {
Node* new_node = new Node(value);
auto q = head.load(memory_order_relaxed);
do {
new_node->next = q;
} while (!head.compare_exchange_strong(q, new_node)); // Neither compare_exchange_strong nor compare_exchange_weak can guarantee thread safety.
}
Node* pop() {
auto q = head.load(memory_order_acquire);
while (q && !head.compare_exchange_strong(q, q->next)) {}
return q;
}
private:
atomic<Node*> head;
};
int RollDice() {
return rand() % 500 + 1;
}
void test_lock_free_stack() {
LockFreeStack<int> stack;
const int num_producers = 15;
const int num_consumers = 15;
const int num_elements_per_producer = 100000;
vector<thread> producers;
vector<thread> consumers;
vector<int> producer_counts(num_producers, 0);
vector<int> consumer_counts(num_consumers, 0);
unordered_set<int> consumed_set[num_consumers];
mutex consumed_set_mutex;
atomic<int> producers_done(0);
// Producer threads
for (int i = 0; i < num_producers; ++i) {
producers.emplace_back([&stack, &producer_counts, i, num_elements_per_producer, &producers_done]() {
srand(time(NULL));
this_thread::sleep_for(chrono::milliseconds(RollDice()));
for (int j = 0; j < num_elements_per_producer; ++j) {
int value = i * num_elements_per_producer + j;
stack.push(value);
producer_counts[i]++;
}
producers_done++;
});
}
// Consumer threads
for (int i = 0; i < num_consumers; ++i) {
consumers.emplace_back([&stack, &consumer_counts, i, &consumed_set, &consumed_set_mutex, &producers_done, num_producers]() {
LockFreeStack<int>::Node* value;
while (producers_done != num_producers || !stack.empty()) {
value = stack.pop();
if (value) {
{
consumed_set[i].insert(value->data);
}
delete value;
consumer_counts[i]++;
}
}
});
}
for (auto& producer : producers) {
producer.join();
}
for (auto& consumer : consumers) {
consumer.join();
}
int total_produced = 0;
for (int count : producer_counts) {
total_produced += count;
}
int total_consumed = 0;
for (int count : consumer_counts) {
total_consumed += count;
}
if (total_produced != total_consumed) {
cout << "total_produced != total_consumed\n";
}
if (!stack.empty()) {
cout << "!stack.empty()\n";
}
unordered_set<int> total_consumed_set;
for (int j = 0; j < num_consumers; j++) {
for (int value : consumed_set[j]) {
total_consumed_set.insert(value);
}
}
for (int i = 0; i < total_produced; ++i) {
if (total_consumed_set.find(i) == total_consumed_set.end()) {
printf("consumed_set.find(%d) == consumed_set.end()\n", i);
}
}
cout << "Test passed. Produced: " << total_produced
<< ", Consumed: " << total_consumed << endl;
}
int main() {
test_lock_free_stack();
return 0;
}
Problem On modern Linux systems, the test passes successfully, and the results are as expected. However, on Windows and some older Linux systems , the test fails with issues like:
Questions: Why is it that when using Boost and GCC 7 to cross-compile a static linked test program (to ensure the use of the same machine code), this test program works on a newer version of Linux (kernel version 4.19), but fails on an older version of Linux (kernel 4.4)? Why does the test fail on Windows (7, 10, 11) regardless of whether I use Boost (version 1.73) or std (I’ve tried compiling with Clang and VS2022), whether on Win32 or x64?
Upvotes: 1
Views: 81
Reputation: 2979
As Nate Eldrege pointed out in his answer there is a race, but it is not the the real issue. Yes, it is possible that once a thread has read head
the node gets deleted. However, most memory managers rarely return memory back to the system (if at all), so most likely the address is still valid and can be read from. Since we are reading from a deleted object we can read random garbage, but we do not really do anything with the value - it is only used in the CAS operation where we try to update head
provided it still has the same value. If the CAS fails (which it should if the node has already been removed and deleted), the value is never used - so no harm done. Of course strictly speaking this is still undefined behavior, but in practice this should still work correctly (if only "by accident").
The real issue is the ABA problem as @PeterCordes has already pointed out. When a node gets deleted its memory is returned to the memory manager, which can then reuse it for later allocations. Now consider the situation where the stack contains the values A B C
. Assume a consumer thread reads head
which points to A
and then reads A's next
which points to B
, but before it can perform the CAS it is interrupted. Now another consumer pops A
and B
and deletes them. Now some producer creates a new node and, since the memory manager can reuse previously returned memory, it happens to end up at the same address as the old A
node. When this node is pushed to the stack it then has the values A C
. Now the interrupted consumer continues with the CAS - it compares A
with A
which obviously succeeds - even though this is a "different" A with a different next
pointer - and happily writes the previously read value to B
. But since B
has already been removed (and deleted), we just corrupted the stack.
The reason this seems to work on some systems but fails on others is the different memory managers.
There are different ways to avoid the ABA problem. The simplest is to combine the head
pointer with a counter and increment the counter with each update. This could be a packed struct with the pointer and the counter, but that would mean you need a double-word CAS operation. Alternatively the counter could be put into the pointers top 16 bits since most OS only use 48-bit addresses. However, it is important to note that this approach does not fix the previously mentioned race condition since it is still possible that we attempt to read from a node that has already been deleted - we just would never use such a value because the CAS is guaranteed to fail. (Strictly speaking it isn't actually "guaranteed" - the counter can wrap around, so theoretically it is still possible to run into the same issue, so the number of bits for the counter has to be big enough so we can consider this as practically impossible.)
Generally this problem is called the "memory reclamation problem". You can check out my Master's thesis Effective Memory Reclamation for Lock-Free Data Structures in C++ for a more in depth explanation together with a number of proposed reclamation schemes to solve this problem. You can also check out my Xenium library which is based on my work for the thesis and contains a number of concurrent data structures and reclamation schemes.
Upvotes: 2
Reputation: 58663
I think there's a fundamental race condition in your algorithm for pop
. Once you have loaded the value of head
into q
, there's nothing to prevent some other thread from popping that node and deleting it, and then accessing q->next
is undefined behavior.
Consider the following interleaving of code. Suppose for concreteness that head
contains address 0xdeadbeef
initially. Thread A is a consumer which enters pop()
, and thread B is another consumer.
Thread A Thread B (consumer)
q = head.load(memory_order_acquire); // now q = 0xdeadbeef
value = stack.pop(); // value = 0xdeadbeef
if (value) { // true
consumed_set[i].insert(value->data);
delete value; // memory at 0xdeadbeef is now invalid
}
while (q && !head.compare_exchange_strong(q, q->next)) {}
In thread A, q
still equals 0xdeadbeef
which is not null. But evaluating q->next
attempts to access memory at 0xdeadbeef
which is no longer valid, and the result is undefined behavior.
I don't know why it would be more likely to fail on some systems than others, but that's certainly not uncommon with undefined behavior. For instance, some systems could be more prone to quickly reusing or overwriting deallocated memory.
Unfortunately, I don't see an easy way to fix this. Lock-free data structures are very hard to get right. You might need to do more research into proven correct algorithms.
Another tip is to use ThreadSanitizer on your test. It detects a data race on delete
which is likely this one (I didn't inspect the backtrace carefully).
Upvotes: 1