netpcvnn
netpcvnn

Reputation: 11

Dynamic allocated array copy in CUDA

Can we have any methods to copy the array in CUDA kernel?

For example:

__device__ int number_element; __device__ void copyData(float* input, float* output){}

I want to copy the data from some of the input array to output which satisfy some condition and also get the number of copied element to number_element

Thank you.

Upvotes: 1

Views: 307

Answers (2)

talonmies
talonmies

Reputation: 72349

What you are really describing is called stream compaction. The thrust library has a range of stream compaction functions built-in which can be called from within kernels. As a trivial example:

#include <iostream>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>

struct op
{
  __host__ __device__
  bool operator()(const int x) { return (x % 3) == 0; }
};

__global__ void kernel(int* input, int* output, int Nin, int* Nout)
{
    auto output_end = thrust::copy_if(thrust::device, input, input + Nin, output, op());
    *Nout = output_end - output;
}

int main()
{
    const int N = 10;
    const size_t sz = sizeof(int) * size_t(N);

    int* in;
    cudaMallocManaged((void **)&in, sz);
    int* out;
    cudaMallocManaged((void **)&out, sz);
    int* Nout;
    cudaMallocManaged((void **)&Nout, sizeof(int));

    for(int i=0; i<N; i++) {
        in[i] = 1+i;
        out[i] = -1;
    }

    kernel<<<1,1>>>(in, out, N, Nout);
    cudaDeviceSynchronize();

    for(int i=0; i < *Nout; i++) {
        std::cout << i << " " << out[i] << std::endl;
    }

    return 0;
}

which compiles and runs like so:

$ nvcc -std=c++11 -arch=sm_52 thrust_device_compact.cu 
$ ./a.out 
0 3
1 6
2 9

This might be a quick and easy way of performing stream compaction within a kernel on a small amount of data. If you have a lot of data, then using thrust from the host and having thrust run kernels on your behalf probably makes more sense.

Upvotes: 2

CygnusX1
CygnusX1

Reputation: 21778

Yes, you can have one by writing it.

You can for example do the way it is done in this answer: Best way to copy global into shared memory, just skip the smem part.

//assumes sizeof(T) is multiple of sizeof(int) and is aligned to at least alignof(int)
//assumes single-dimention kernel
//assumes it is launched for all threads in block
template <typename T>
__device__ void memCopy(T* dest, T* src, size_t size) {
    int* iDest = (int*)dest;
    int* iSrc = (int*)src;
    for(size_t i = threadIdx.x; i<size*sizeof(T)/sizeof(int); i+=blockDim.x)
        iDest[i] = iSrc[i];
    __syncthreads();
}

This assumes a single block operation, intended for use for that specific block. If you want a whole grid, you certianly can, but need to launch it as a separate kernel in order to ensure that all writes are visible by all other blocks. In that case cudaMemcpy may be better than a kernel call.

In any case, for a grid operation you need to change the loop:

for(size_t i = threadIdx.x+blockIdx.x*blockDim.x;
    i<size*sizeof(T)/sizeof(int);
    i+=blockDim.x*gridDim.x)

Upvotes: 0

Related Questions