Reputation: 8505
I am using a custom network protocol library. This library is built on TCP/IP and is supposedly to be used in high frequency messaging. It is a non-blocking library and uses callbacks as the interface to integrate with the caller.
I am no performance expert, and that is why I decided to ask this question here. The custom library comes with a particular constraint, outlined below:
"Callee should not invoke any of the library's API under the context of the callback thread. If they attempt to do so, the thread will hang"
The only way to overcome API restriction is that I start another thread which process message and invokes the library to send a response. The library thread and process thread would share a common queue, which would be protected by a mutex and use wait_notify()
calls to indicate the presence of a message.
If I am receiving 80k messages per second, then I would be putting threads to sleep and waking them up pretty often, performing thread context switches ~80k times per second.
Plus, as there are two threads, they will not share the message buffer in the L1 cache. The cache line containing the message would first be filled by the library's thread, then evicted and pulled into the process thread's core's L1 cache. Am I missing something or is it possible that the library's design is not meant for high performance use cases?
My questions are:
I have seen the warnings like "Don't use this API in a callback's context as it can cause locks." across many libraries. What are the common design choices that cause such design constraints? They can use recursive locks if it is a simple question of same thread calling the lock multiple times. Is this a re-entrant issue, and what challenges might cause an API owner to make non re-entrant API?
Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?
How expensive are volatile sig_atomic_t
's as a mechanism to share data between two threads?
Given a high frequency scenario, what is a light-weight way to share information between two threads?
The library and my application are built on C++ and Linux.
Upvotes: 12
Views: 2923
Reputation: 3350
An important thing to keep in mind here is that when working on network applications, the more important performance metric is "latency-per-task" and not the raw cpu cycle throughput of the entire application. To that end, thread message queues tend to be a very good method for responding to activity in the quickest possible fashion.
80k messages per second on today's server infrastructure (or even my Core i3 laptop) is bordering on being insignificant territory -- especially insofar as L1 cache performance is concerned. If the threads are doing a significant amount of work, then its not unreasonable at all to expect the CPU to flush through the L1 cache every time a message is processed, and if the messages are not doing very much work at all, then it just doesn't matter because its probably going to be less than 1% of the CPU load regardless of L1 policy.
At that rate of messaging I would recommend a passive threading model, eg. one where threads are woken up to handle messages and then fall back asleep. That will give you the best latency-vs-performance model. Eg, its not the most performance-efficient method but it will be the best at responding quickly to network requests (which is usually what you want to favor when doing network programming).
On today's architectures (2.8ghz, 4+ cores), I wouldn't even begin to worry about raw performance unless I expected to be handling maybe 1 million queued messages per second. And even then, it'd depend a bit on exactly how much Real Work the messages are expected to perform. It it isn't expected to do much more than prep and send some packets, then 1 mil is definitely conservative.
Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?
No. I mean, sure there is if you want to roll your own Operating System. But if you want to run in a multitasking environment with the expectation of sharing the CPU with other tasks, then "No." And locking threads to cores is something that is very likely to hurt your threads' avg response times, without providing much in the way of better performance. (and any performance gain would be subject to the system being used exclusively for your software and would probably evaporate on a system running multiple tasks)
Given a high frequency scenario, what is a light-weight way to share information between two threads?
Message queues. :) Seriously. I don't mean to sound silly, but that's what message queues are. They share information between two threads and they're typically light-weight about it. If you want to reduce context switches, only signal the worker to drain the queue after some number of messages have accumulated (or some timeout period, in case of low activity) -- but be weary that will increase your program's response time/latency.
Upvotes: 1
Reputation:
How can 2 threads share the same cache line?
Threads have nothing to do with cache lines. At least not explicitly. What can go wrong is cache flush on context switch and TLB invalidation, but given the same virtual address mapping for threads, caches should generally be oblivious to these things.
What are the common design choices that cause such design constraints?
Implementors of the library do not want to deal with:
on_error()
, from which you call send()
again — that would need a special care to be taken by them).I personally consider it as a very bad thing to have an API designed around callbacks when it comes to high performance and especially network-related things. Though sometimes it makes life a lot simpler for both users and developers (in terms of just ease of writing the code). The only exception to this might be CPU interrupt handling, but that is a different story and you can hardly call it an API.
They can use recursive locks if it is a simple question of same thread calling the lock multiple times.
Recursive mutex are relatively very expensive. People who care about run-time efficiency tend to avoid them where possible.
Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?
Yes. You will have to pin both threads to the same CPU core. For example, by using sched_setaffinity()
. But this also goes beyond a single program — the whole environment must be configured right. For example, you may want to consider not allowing OS to run anything on that core but your two threads (including interrupts), and not allow these two threads to migrate to a different CPU.
How expensive are volatile sig_atomic_t's as a mechanism to share data between two threads?
By itself it is not expensive. In a multi-core environment, however, you may some cache invalidation, stalls, increased MESI traffic, etc. Given that both of the threads are on the same core and nothing intrudes — the only penalty is not being able to cache the variable, which is OK since it should not be cached (i.e. compiler will always fetch it from memory, be that a cache or main memory).
Given a high frequency scenario, what is a light-weight way to share information between two threads?
Read and write from/to the same memory. Possibly without any system calls, blocking calls, etc. For example, one can implement ring buffers with for two concurrent threads by using memory barriers and nothing else, for Intel architecture at least. You have to be extremely down to details in order to do that. If, however, something must be explicitly synchronized, then atomic instructions are the next level. Haswell also comes with Transactional Memory that can be used for low overhead synchronization. After that nothing is fast.
Also, take a look at the Intel Architectures Developer's Manual, Chapter 11, about memory cache & control.
Upvotes: 6