CUDA spinlock implementation with Independent Thread Scheduling supported?

Question

I'd like to revisit the situation of implementing a simple spinlock on CUDA, now that Independent Thread Scheduling (ITS) has been introduced for a while.

My code looks like this:

// nvcc main.cu -arch=sm_75
#include 
#include 
#include 

#include "cuda.h"

constexpr int kN = 21;

using Ptr = uint8_t*;

struct DynamicNode {
  int32_t lock = 0;
  int32_t n = 0;
  Ptr ptr = nullptr;
};

__global__ void func0(DynamicNode* base) {
  for (int i = 0; i < kN; ++i) {
    DynamicNode* dn = base + i;
    atomicAdd(&(dn->n), 1);
    // entering the critical section
    auto* lock = &(dn->lock);
    while (atomicExch(lock, 1) == 1) {
    }
    __threadfence();
    // Use a condition to artificially boost the complexity
    // of loop unrolling for the compiler
    if (dn->ptr == nullptr) {
      dn->ptr = reinterpret_cast(0xf0);
    }
    // leaving the critical section
    atomicExch(lock, 0);
    __threadfence();
  }
}

int main() {
  DynamicNode* dev_root = nullptr;
  constexpr int kRootSize = sizeof(DynamicNode) * kN;
  cudaMalloc((void**)&dev_root, kRootSize);
  cudaMemset(dev_root, 0, kRootSize);
  func0<<<1, kN>>>(dev_root);
  cudaDeviceSynchronize();

  std::vector host_root(kRootSize / sizeof(int32_t), 0);
  cudaMemcpy(host_root.data(), dev_root, kRootSize, cudaMemcpyDeviceToHost);
  cudaFree((void*)dev_root);
  const auto* base = reinterpret_cast(host_root.data());
  int sum = 0;
  for (int i = 0; i < kN; ++i) {
    auto& dn = base[i];
    std::cout << "i=" << i << " len=" << dn.n << std::endl;
    sum += dn.n;
  }
  std::cout << "sum=" << sum << " expected=" << kN * kN << std::endl;
  return 0;
}

As you can see, there's a naive spinlock implemented in func0. While I understand that this would result in deadlock on older archs (e.g. https://forums.developer.nvidia.com/t/atomic-locks/25522/2), if I compile the code with nvcc main.cu -arch=sm_75, it actually runs without blocking indefinitely.

However, what I do notice is that n in each DynamicNode went completely garbage. Here's the output on GeForce RTX 2060 (laptop), which I can reproduce deterministically:

i=0 len=21
i=1 len=230
i=2 len=19
i=3 len=18
i=4 len=17
i=5 len=16
i=6 len=15
i=7 len=14
i=8 len=13
i=9 len=12
i=10 len=11
i=11 len=10
i=12 len=9
i=13 len=8
i=14 len=7
i=15 len=6
i=16 len=5
i=17 len=4
i=18 len=3
i=19 len=2
i=20 len=1
sum=441 expected=441

Ideally, the length of all the DynamicNodes should be kN. I've also tried with larger kN (*), and it's always that only sum is correct.

Have I misunderstood something about ITS? Can ITS actually warrant such a lock implementation? If not, what am I missing here?

(*) With a smaller kN, nvcc might actually unroll the loop, from what I saw in the PTX. I've never observed any problem when the loop is unrolled.

Update 02/02/2021

I should have clarified that I tested this on CUDA 11.1. According to @robert-crovella, upgrading to 11.2 would fix the problem.

Update 02/03/2021

I tested with CUDA 11.2 driver, it still didn't fully solve the problem with a larger kN:

kN \ CUDA	11.1	11.2
21	N	OK
128	N	N

CUDA spinlock implementation with Independent Thread Scheduling supported?

Answers (1)

Related Questions