TVC
TVC

Reputation: 482

c# managedCuda 2d array to GPU

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel. I have to following working code for 1 dimension array:

class Program
{
    static void Main(string[] args)
    {
        int N = 10;
        int deviceID = 0;
        CudaContext ctx = new CudaContext(deviceID);
        CudaKernel kernel = ctx.LoadKernel(@"doubleIt.ptx", "DoubleIt");
        kernel.GridDimensions = (N + 255) / 256;
        kernel.BlockDimensions = Math.Min(N,256);

        // Allocate input vectors h_A in host memory
        float[] h_A = new float[N];

        // Initialize input vectors h_A
        for (int i = 0; i < N; i++)
        {
            h_A[i] = i;
        }

        // Allocate vectors in device memory and copy vectors from host memory to device memory 
        CudaDeviceVariable<float> d_A = h_A;
        CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);

        // Invoke kernel
        kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);

        // Copy result from device memory to host memory
        float[] h_C = d_C;
        // h_C contains the result in host memory
    }
}

with the following kernel code:

__global__ void DoubleIt(const float* A, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] * 2;
}

as I said, everything works fine but I want to work with 2d array as follow:

// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];

// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
    h_A[i] = new float[W];
    for (int j = 0; j < W; j++)
    {
        h_A[i][j] = i*W+j;
    }
}

I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.

so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).

Upvotes: 0

Views: 2047

Answers (1)

kunzmi
kunzmi

Reputation: 1024

Short answer: you shouldn't do it...

Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...

To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.

For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...

This would realize the jagged array copy:

float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.

int sizeX = 512;
int sizeY = 256;

data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
{
    data_h[x] = new float[sizeY];
    arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
    ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
    //ToDo: init data on host...
}
//Copy the pointers once:
data_d.CopyToDevice(ptrsToData_h);

//Copy data:
for (int x = 0; x < sizeX; x++)
{
    arrayOfarray_d[x].CopyToDevice(data_h[x]);
}

//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);

//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)

This is a sample for CudaPitchedDeviceVariable:

int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
{
    for (int x = 0; x < dimX; x++)
    {
        array_host[y * dimX + x] = x * y;
    }
}

arrayPitched_d.CopyToDevice(array_host);
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);

//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= dimX || y >= dimY)
        return;

    //pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
    //which gives the start of line y. Convert to float* and add x, to get the
    //value at entry x of line y:
    float value = *(((float*)((char*)data + y * pitch)) + x);

    *(((float*)((char*)data + y * pitch)) + x) = value + 1;

    //Or simpler if you don't like pointers:
    float* line = (float*)((char*)data + y * pitch);
    float value2 = line[x];
}

Upvotes: 2

Related Questions