Reputation: 4955
Linux 4.18 introduced the rseq(2)
syscall. I found only one question on SO that mentions rseq
at all, and there is relatively little information about it online, so I decided to ask. What are restartable sequences and how can they be used by programmers?
I had to search for restartable sequences: fast user-space percpu critical sections
to get any meaningful results. I was able to find the commit that added the relevant functionality to the kernel. Further research lead me to the 2013 presentation that I believe to be the first introduction of the concept. Much work has been done by a team working at a company called EfficiOS. They describe what are their intentions behind contributing this feature to the Linux kernel.
It looks like this feature is very little known, but apparently it is used to optimize performance in the TCMalloc allocator. Generally, it seems that it is some kind of concurrency optimization.
Although the listed sources provide background information there is yet to be an explanation of RSEQs provided on SO. It would be useful to hear where else and how are they used in practice.
Suppose I am creating a C++ job system. A part of it is a lock-free multi-producer single-consumer queue. How could I introduce the use of the rseq(2)
syscall into my code to potentially improve its performance?
class mpsc_list_node
{
mpsc_list_node* _next;
template<typename T>
requires std::derived_from<T, mpsc_list_node>
friend class mpsc_list;
};
template<typename T>
requires std::derived_from<T, mpsc_list_node>
class mpsc_list
{
private:
std::atomic<T*> head{ nullptr };
private:
static constexpr size_t COMPLETED_SENTINEL = 42;
public:
mpsc_list() noexcept = default;
mpsc_list(mpsc_list&& other) noexcept :
head{ other.head.exchange(reinterpret_cast<T*>(COMPLETED_SENTINEL), std::memory_order_relaxed) }
{
}
bool try_enqueue(T& to_append)
{
T* old_head = head.load(std::memory_order_relaxed);
do
{
if (reinterpret_cast<size_t>(old_head) == COMPLETED_SENTINEL)
[[unlikely]]
{
return false;
}
to_append._next = old_head;
} while (!head.compare_exchange_weak(old_head, &to_append, std::memory_order_release, std::memory_order_relaxed));
return true;
}
template<typename Func>
void complete_and_iterate(Func&& func) noexcept(std::is_nothrow_invocable_v<Func, T&>)
{
T* p = head.exchange(reinterpret_cast<T*>(COMPLETED_SENTINEL), std::memory_order_acquire);
while (p)
[[likely]]
{
T* cur = p;
T* next = static_cast<T*>(p->_next);
p = next;
func(*cur);
}
}
};
The purpose of this mpsc_list
and its place in the job system is well explained by my README
:
Inter-job synchronization
The only synchronization primitive used is an atomic counter. This idea stems from the famous GDC talk on the implementation of a job system using fibers in the Naughty Dog's game engine.
The common promise type of the jobs (
promise_base
) is actually a derived type frommpsc_list
, a multi-producer single-consumer list. This list stores the jobs dependent on the current job. It is a lock-less linked list implemented using atomic operations. Each node stores a pointer to the dependent's promise and the next node. Interestingly, this linked list does not use any dynamic memory allocation.When a job
co_await
s a (possibly 1-sized) set of dependency jobs, it does a few things. Firstly, its promise sets its own internal atomic counter to the number of dependency jobs. Then, it allocates (on the stack) a dependency-count-sized array ofnotifier
objects. Thenotifier
type is the type of the linked list's node. The creatednotifier
s all point to the job being suspended. They do not have a next node.Then, the job goes through each of its dependency jobs and tries to append the corresponding
notifier
to the dependency's list. If that dependency has already completed, this operation fails. This is because when a job returns, it sets its list's head to a special sentinel value. If the dependency has already completed (for example, on a different thread), then the suspending job simply decrements its own atomic counter. If the dependency hasn't completed yet, it appends thenotifier
to that dependency's list. It does so using a CAS loop.After having gone through each of the dependencies, the suspending job checks how many of its dependencies have already completed. If all of them have, then it doesn't suspend and immediately continues execution. This isn't just an optimization. This is necessary for the job system to function properly. This is because this job system does not have a suspended job queue. The job system only has a ready job queue. Suspended jobs are stored only inside their dependencies' linked lists. Hence, if a job would suspend, while not having any dependencies, it would never get resumed.
When a job returns, it traverses its linked list of dependents. Firstly, it sets the list's head to the special sentinel value. Then, it goes through all of the jobs, atomically decrementing their atomic counters. The decrement is a RMW operation, so the job reads the counter's previous value. If it is one, then it knows that it is the last dependency to complete for that job, and it
push
es it to the job queue.
Upvotes: 12
Views: 2191
Reputation: 4111
Let's understand why the TCMalloc example is a good use case for rseq
. So we first need to understand the idea of that aspect of TCMalloc. TCMalloc keeps some data structure of recently freed memory blocks (the exact data structure doesn't actually matter for the discussion). TCMalloc does so, because adding memory blocks to that data structure is fast, and pulling a block off that structure in case an identical-sized block is needed is also fast. That first stage of deallocation handling intentionally does not consolidation of free blocks or any other more complex management tasks.
If you only had a single structure of "recently freed blocks" in your program, you would need to make sure that multiple threads accessing this structure do not cause data corruption. There are three main ways to approach it:
The first idea is likely not that great, because acquiring and freeing mutexes is a quite heavy task, which you don't want to have on the fast path of malloc/free. The second idea might fly, but lock-free data structures are usually more complex and slower to operate on than plain data structures, so they are not the first choice for the fast path. The third variant on the other hand is a quite good choice, as it means that every thread can use fast, unsynchronized access to its private data structures of recently freed blocks, and this approach is quite common in allocators for multi-threaded programs.
On the other hand, this means that the data structure has to be allocated per thread (in thread-local storage), has to be initialized on every thread creation, and free blocks from the thread-local "free block pool" have to be merged on thread finalization. This carries a lot of per-thread overhead, which can matter if you task creates a lot of short-lived threads.
The key observation at this point is that the only real requirement is that every thread that executes in parallel needs a different copy of the data structure. We seperate the free-block structures not to actually associate them with a specific thread, but just to make sure that multiple currently executing threads do not stomp on each other's feet. If there is a way we can add a block to a free-block structure without being afraid of another thread manipulating the same structure at the same time, we the system works - even if the same structure is later read by a different thread and the block gets re-used in that different thread.
This creates the idea of using per-CPU data instead of per-thread data. Assuming we have a way that we could prevent thread preemption for a certain amount of time, we can just "disable preemption", then update the free-block structure, and finally "re-enable preemption" when we are done. This enables us to use a data block that is specific to the CPU we are currently running on, and be sure that no other thread is stomping on the same data at the same time. This means we can initialize the structures once for the count of available CPU (cores) at start-up and keep those structure alive the whole time of process execution, even if threads come and go. Furthermore, it means a block freed on one CPU will likely get re-allocated on the same CPU, which might still have that block in cache, which is a performance gain. This can even happen if the thread on that CPU was switched to a different one!
Now rseq
is a used as a way to emulate disabling on preemption. It works in a way that a final write to memory at the end of the "restartable sequence" is the key that makes the changes "visible", like creating a new linked list node which refers to the current list head node as next node (which does not interfere with the list as it is now), and then update the head node pointer to the new node (the "final write"). You can set up rseq
in a way that if you list insertion process is interrupted (and a different thread or a signal handler has been executed on the same CPU), you re-start it from beginning, and only if the process is not interrupted (as if preemption was disabled) you get to the final write where the sequence ends.
So, the key point is that rseq
is used (in this application example) to avoid memory sharing between different CPUs, because it is slow and will lead to data corruption. On the other hand, the point of you single-consumer core is that a single CPU consumes jobs created by other CPUs, so you need to enable communication between different CPUs. I fail to see how restartable sequences are going to help you with that. As I understand it, your lock-free implementation of that queue should do quite well.
I suggest you to only think about using rseq
as a tool if you have a problem in which thread-local data would be a viable solution as well.
Upvotes: 2
Reputation: 421
The rseq
syscall was introduced to allow for restartable sequences
, which were invented to allow for per-cpu variables
in userspace. Per-cpu variables are easy in kernel-space as you can disable preemption at any time you reference a per-cpu.
In userspace you have to account for unwanted preemption if you want to handle your per-cpu variables consistently with respect to the current thread/CPU. For that the restartable sequences
mechanism was introduced
Upvotes: 1