Ye Kuang
Ye Kuang

Reputation: 103

CUDA spinlock implementation with Independent Thread Scheduling supported?

I'd like to revisit the situation of implementing a simple spinlock on CUDA, now that Independent Thread Scheduling (ITS) has been introduced for a while.

My code looks like this:

// nvcc main.cu -arch=sm_75
#include <cstdio>
#include <iostream>
#include <vector>

#include "cuda.h"

constexpr int kN = 21;

using Ptr = uint8_t*;

struct DynamicNode {
  int32_t lock = 0;
  int32_t n = 0;
  Ptr ptr = nullptr;
};

__global__ void func0(DynamicNode* base) {
  for (int i = 0; i < kN; ++i) {
    DynamicNode* dn = base + i;
    atomicAdd(&(dn->n), 1);
    // entering the critical section
    auto* lock = &(dn->lock);
    while (atomicExch(lock, 1) == 1) {
    }
    __threadfence();
    // Use a condition to artificially boost the complexity
    // of loop unrolling for the compiler
    if (dn->ptr == nullptr) {
      dn->ptr = reinterpret_cast<Ptr>(0xf0);
    }
    // leaving the critical section
    atomicExch(lock, 0);
    __threadfence();
  }
}

int main() {
  DynamicNode* dev_root = nullptr;
  constexpr int kRootSize = sizeof(DynamicNode) * kN;
  cudaMalloc((void**)&dev_root, kRootSize);
  cudaMemset(dev_root, 0, kRootSize);
  func0<<<1, kN>>>(dev_root);
  cudaDeviceSynchronize();

  std::vector<int32_t> host_root(kRootSize / sizeof(int32_t), 0);
  cudaMemcpy(host_root.data(), dev_root, kRootSize, cudaMemcpyDeviceToHost);
  cudaFree((void*)dev_root);
  const auto* base = reinterpret_cast<const DynamicNode*>(host_root.data());
  int sum = 0;
  for (int i = 0; i < kN; ++i) {
    auto& dn = base[i];
    std::cout << "i=" << i << " len=" << dn.n << std::endl;
    sum += dn.n;
  }
  std::cout << "sum=" << sum << " expected=" << kN * kN << std::endl;
  return 0;
}

As you can see, there's a naive spinlock implemented in func0. While I understand that this would result in deadlock on older archs (e.g. https://forums.developer.nvidia.com/t/atomic-locks/25522/2), if I compile the code with nvcc main.cu -arch=sm_75, it actually runs without blocking indefinitely.

However, what I do notice is that n in each DynamicNode went completely garbage. Here's the output on GeForce RTX 2060 (laptop), which I can reproduce deterministically:

i=0 len=21
i=1 len=230
i=2 len=19
i=3 len=18
i=4 len=17
i=5 len=16
i=6 len=15
i=7 len=14
i=8 len=13
i=9 len=12
i=10 len=11
i=11 len=10
i=12 len=9
i=13 len=8
i=14 len=7
i=15 len=6
i=16 len=5
i=17 len=4
i=18 len=3
i=19 len=2
i=20 len=1
sum=441 expected=441

Ideally, the length of all the DynamicNodes should be kN. I've also tried with larger kN (*), and it's always that only sum is correct.

Have I misunderstood something about ITS? Can ITS actually warrant such a lock implementation? If not, what am I missing here?

(*) With a smaller kN, nvcc might actually unroll the loop, from what I saw in the PTX. I've never observed any problem when the loop is unrolled.


Update 02/02/2021

I should have clarified that I tested this on CUDA 11.1. According to @robert-crovella, upgrading to 11.2 would fix the problem.


Update 02/03/2021

I tested with CUDA 11.2 driver, it still didn't fully solve the problem with a larger kN:

kN \ CUDA 11.1 11.2
21 N OK
128 N N

Upvotes: 1

Views: 392

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151879

This appears to have been some sort of code generation defect in the compiler. The solution seems to be to update to CUDA 11.2 (or newer, presumably, in the future).

Upvotes: 1

Related Questions