Creating a vector in cuda kernel

Question

I need to create some sort of stack to crawl a tree within a cuda kernel. I thought I could use thrust::device_vector but apparently not. Is there an api for this or do I have to just code it myself.

__global__ 
void step_objects_kernel(ContainerNode* root, ObjectNode** objs, ObjectNode* new_objs, size_t n, real dt, real g)
{
    int idx = blockIdx.x * gridDim.x + threadIdx.x;
    if(idx >= n) return;

    thrust::device_vector to_visit;
    to_visit.push_back(root);

    vec3 a = {0};
    while(!to_visit.empty())
    {
        Node* n = to_visit.back();
        to_visit.pop_back();

    }
}
error: calling a __host__ function("thrust::device_vector > ::device_vector") from a __global__ function("step_objects_kernel") is not allowed

Robert Crovella · Accepted Answer

It is correct that thrust::device_vector is not usable in CUDA device code.

I'm not aware of any in-kernel container-like API that is part of the CUDA distribution itself. If you search around, however, you will probably find dozens of possibly useful/interesting implementations. A lower level library like trove could possibly give improved performance for this kind of use-case.

In your example, it appears that each thread will maintain its own "stack" or "vector" to keep track of tree traversal. (The method I will offer here depends on not having threads concurrently accessing the same stack. If you need concurrent access from several threads, the method here may be of interest as a starting point.)

If you know what the maximum probable size for such a stack would be, I would suggest allocating for it ahead of time, either a static (local) variable definition per-thread in-kernel, or a dynamic allocation e.g. via cudaMalloc. (I would not suggest in-kernel malloc for this, and I definitely would not suggest allocating/deallocating on-the-fly, for performance reasons.) The choice of which allocation method will give the most performance may depend on your actual test case. The coalescing rules (i.e. underlying storage method) are somewhat different for access to a global pointer vs. access to a local pointer. If your threads will tend to push or pop uniformly across a warp and as your code progresses, then either allocation method may give good performance. You can experiment with either approach.

Here's a fairly simple partially worked example of the "stack" methods you have outlined in your example, assuming the maximum stack size per thread is known a priori. It's by no means fully tested; my purpose is to give you some ideas or a starting point. However if you find errors, please feel free to point them out and I will try to address them.

$ cat t1082.cu

const size_t max_items = 256;


template 
class cu_st{  // simple implementation of "stack" function

  T *my_ptr;
  size_t n_items;
  size_t my_width;
  public:
    __host__ __device__
    cu_st(T *base, size_t id, size_t width=0){
      if (width == 0){ // "local" stack allocated
        my_ptr = base;
        my_width = 1;}
      else{            // "global" stack allocated
        my_ptr = base + id;
        my_width = width;}
      n_items = 0;}

    __host__ __device__
    int push_back(T &item){
      if (n_items < max_items){
        *my_ptr = item;
        my_ptr += my_width;
        n_items++;
        return 0;}
      return -1;}

    __host__ __device__
    T pop_back(){
      if (n_items > 0){
        n_items--;
        my_ptr -= my_width;}
      return *my_ptr;}

    __host__ __device__
    T back(){
      if (n_items > 0){
        return *(my_ptr-my_width);}
      return *my_ptr;}

    __host__ __device__
    bool empty(){
      return (n_items == 0);}

    __host__ __device__
    size_t size(){
      return n_items;}

    __host__ __device__
    size_t max_size(){
      return max_items;}

};

const size_t nTPB = 256;
const size_t nBLK = 256;

typedef int Node;

__global__
void kernel(Node **g_stack, size_t n)
{
    int idx = blockIdx.x * gridDim.x + threadIdx.x;
    if(idx >= n) return;
    Node *root = NULL;

    //method 1 - global stack
    cu_st to_visit(g_stack, idx, gridDim.x*blockDim.x);
    to_visit.push_back(root);

    while(!to_visit.empty())
    {
        Node* n = to_visit.back();
        to_visit.pop_back();

    }

    //method 2 - local stack

    Node *l_stack[max_items];
    cu_st l_to_visit(l_stack, idx);
    l_to_visit.push_back(root);

    while(!l_to_visit.empty())
    {
        Node* n = l_to_visit.back();
        l_to_visit.pop_back();

    }

}

int main(){

  Node **d_stack;
  cudaMalloc(&d_stack, nTPB*nBLK*max_items*sizeof(Node *));
  kernel<<>>(d_stack, nTPB*nBLK);
  cudaDeviceSynchronize();
}
$ nvcc -o t1082 t1082.cu
$ cuda-memcheck ./t1082
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

Notes:

Other than what you see here, this code is not tested. I'd suggest doing more verification before using it as-is.
As you can see in the code, there is essentially no error checking.
This sort of random access will generally tend to be slow, probably regardless of which allocation method you choose. If possible, minimize your use of such a "stack". If you know that the stack size per thread is very small, you could also try experimenting with using this construct with a __shared__ memory allocation.
Another allocation approach which I have not demonstrated here would be to give each thread a global allocation but have the thread push and pop contiguously rather than in the strided fashion I have shown here (algorithmically a combination of the two methods I have outlined here). Such a method will definitely degrade performance in the "uniform" case, but may give better performance in some "random" access patterns.

Creating a vector in cuda kernel

Answers (1)

Related Questions