Reputation: 23
Sequential loop, critical section, atomic update.
I have a 2D graph consisting of vertices and edges.
For each vertex, I want to compute the number of degrees of freedom (dof). A vertex dof = the number of edges connected to the vertex.
What I do is:
std::vector<size_t> dofs(graph.getNumVertices(), 0);
#pragma omp parallel for
for (int edgeInd = 0; edgeInd < graph.getNumEdges(); ++edgeInd)
{
#pragma omp atomic
++dofs[graph.getEdge(edgeInd).v1.index];
#pragma omp atomic
++dofs[graph.getEdge(edgeInd).v2.index];
}
Which works as expected.
My questions:
All edges are distributed among different threads. Let's assume there are only two threads: 1 and 2.
An edge can have:
Question 1: Does atomic update "activate itself" only when needed? And is the critical section (if applied instead) active all the time?
Meaning, is the following table correct (?):
An edge can have: ATOMIC CRITICAL
Question 2:
Theoretically, I expect:
Correct?
Upvotes: 0
Views: 185
Reputation: 50836
Question 1: Does atomic update "activate itself" only when needed? And is the critical section (if applied instead) active all the time?
Well, an atomic operation is always atomic, and because of that, you need to pay a significant additional cost just due to a potential access done by other thread, even when this does not happen. That being say the cost of atomics is dependent of the number of threads accessing the same atomic variable (actually, it is even the same cache line). Shared accesses are significantly slower.
In practice, atomic accesses first get a cache line from the cache and lock it. If the cache line has been previously locked by another thread on another core, the cache line will likely be stored in the LLC cache with a pretty big latency (e.g. several dozens of cycles). Once the atomic operation is done, the lock is released. If multiple atomic variable are stored on the same cache line, atomic operations are serialized. Such an issue is called false sharing of atomic variable. When multiple cores access the same atomic variable concurrently in a loop, the cache line needs to ping-pong between cores which is pretty slow, not to mention there are additional related costs making this even slower. This is called contention on atomic variables. In such a case, there is no parallelism so everything is serialized. On top of that, the contention overhead is so big that is make things far slower than a sequential code. You should avoid this access pattern like the plague.
Atomic operations are significantly slower than basic non-atomic operations so you may need several threads for the computation to be as fast as the sequential one assuming there is no contention (otherwise it almost never worth it). If the graph is large and the average number of degrees of freedom is small, then atomic should scale assuming the target platform have enough cores to mitigate the cost of atomic accesses. There is a bad news though : atomic accesses tends to be more expensive on platforms with more cores. Indeed, on many cores platforms, the LLC tends to be significantly bigger which generally means a bigger latency to get cache lines from it (not to mention the cache coherence between cache of cores tends also to be more expensive).
Question 2: Theoretically, I expect: [...]
As said before, the performance is highly dependent of the number of thread used, the size of the graph, its topology and the target platform. That being said, I expect the two parallel versions to be slower with few threads than the sequential one. The atomic version can scale better than the one locks, but it depends how you use locks. Indeed, locks are typically implemented using an atomic variable internally (the user-land lock state). If the lock is not taken, then the overhead of a lock is similar to an atomic variable. If the lock is taken, then the lock wait for the lock to be released by another thread causing an expensive context switch. That being said, context switch can help to reduce contention and so lock can sometimes be faster than atomics with contention.
For such a use-case, I expect splitting the graph in chunks each computed by only one thread to be faster than using lock and atomics. See the answer of @VictorEijkhout.
If the graph is huge, then the cost the operation will be completely latency-bound due to the high latency of DRAM accesses (eg. 60-120 ns). The impact of atomics will certainly be small in that case as long as there is no contention (otherwise this will be a huge disaster). Improving data locality and the memory access pattern using chunks can significantly improve performance. Strategies like sorting nodes, clustering, and more compact data structure can often help. Besides, such a computation also tends not to scale because each access fetch a full cache line while only few bytes are actually used so most of the DRAM bandwidth is wasted. In top of that, on most mainstream processors, the throughput cannot be even close to the DRAM bandwidth because of the way DRAM operate. Indeed, DRAM are optimized for sequential accesses due to prefetching and burst reads/writes. Moreover, they are split in multiple banks. Using multiple banks is critical to mitigate the latency of bank accesses but it is only possible if data is shared quite evenly between banks (even at fine grain).
Upvotes: 3
Reputation: 5810
As the comment by @PierU observed, your atomic directive makes the code correct, but probably effectively serial. You need a different approach.
See my previous answers to very similar questions:
https://stackoverflow.com/a/77273690/2044454
https://stackoverflow.com/a/76854330/2044454
This is considerably easier to code in C++, but you can emulate it in C.
Upvotes: 1