Reputation: 103
I'd like to revisit the situation of implementing a simple spinlock on CUDA, now that Independent Thread Scheduling (ITS) has been introduced for a while.
My code looks like this:
// nvcc main.cu -arch=sm_75
#include <cstdio>
#include <iostream>
#include <vector>
#include "cuda.h"
constexpr int kN = 21;
using Ptr = uint8_t*;
struct DynamicNode {
int32_t lock = 0;
int32_t n = 0;
Ptr ptr = nullptr;
};
__global__ void func0(DynamicNode* base) {
for (int i = 0; i < kN; ++i) {
DynamicNode* dn = base + i;
atomicAdd(&(dn->n), 1);
// entering the critical section
auto* lock = &(dn->lock);
while (atomicExch(lock, 1) == 1) {
}
__threadfence();
// Use a condition to artificially boost the complexity
// of loop unrolling for the compiler
if (dn->ptr == nullptr) {
dn->ptr = reinterpret_cast<Ptr>(0xf0);
}
// leaving the critical section
atomicExch(lock, 0);
__threadfence();
}
}
int main() {
DynamicNode* dev_root = nullptr;
constexpr int kRootSize = sizeof(DynamicNode) * kN;
cudaMalloc((void**)&dev_root, kRootSize);
cudaMemset(dev_root, 0, kRootSize);
func0<<<1, kN>>>(dev_root);
cudaDeviceSynchronize();
std::vector<int32_t> host_root(kRootSize / sizeof(int32_t), 0);
cudaMemcpy(host_root.data(), dev_root, kRootSize, cudaMemcpyDeviceToHost);
cudaFree((void*)dev_root);
const auto* base = reinterpret_cast<const DynamicNode*>(host_root.data());
int sum = 0;
for (int i = 0; i < kN; ++i) {
auto& dn = base[i];
std::cout << "i=" << i << " len=" << dn.n << std::endl;
sum += dn.n;
}
std::cout << "sum=" << sum << " expected=" << kN * kN << std::endl;
return 0;
}
As you can see, there's a naive spinlock implemented in func0
. While I understand that this would result in deadlock on older archs (e.g. https://forums.developer.nvidia.com/t/atomic-locks/25522/2), if I compile the code with nvcc main.cu -arch=sm_75
, it actually runs without blocking indefinitely.
However, what I do notice is that n
in each DynamicNode
went completely garbage. Here's the output on GeForce RTX 2060 (laptop)
, which I can reproduce deterministically:
i=0 len=21
i=1 len=230
i=2 len=19
i=3 len=18
i=4 len=17
i=5 len=16
i=6 len=15
i=7 len=14
i=8 len=13
i=9 len=12
i=10 len=11
i=11 len=10
i=12 len=9
i=13 len=8
i=14 len=7
i=15 len=6
i=16 len=5
i=17 len=4
i=18 len=3
i=19 len=2
i=20 len=1
sum=441 expected=441
Ideally, the length of all the DynamicNode
s should be kN
. I've also tried with larger kN
(*), and it's always that only sum
is correct.
Have I misunderstood something about ITS? Can ITS actually warrant such a lock implementation? If not, what am I missing here?
(*) With a smaller kN
, nvcc
might actually unroll the loop, from what I saw in the PTX. I've never observed any problem when the loop is unrolled.
Update 02/02/2021
I should have clarified that I tested this on CUDA 11.1
. According to @robert-crovella, upgrading to 11.2
would fix the problem.
Update 02/03/2021
I tested with CUDA 11.2
driver, it still didn't fully solve the problem with a larger kN
:
kN \ CUDA | 11.1 | 11.2 |
---|---|---|
21 | N | OK |
128 | N | N |
Upvotes: 1
Views: 392
Reputation: 151879
This appears to have been some sort of code generation defect in the compiler. The solution seems to be to update to CUDA 11.2 (or newer, presumably, in the future).
Upvotes: 1