How can I copy a part of 4D array from Host memory to Device memory?

Question

I have flattened 4-D array in Host array.
And I want to copy a part(red region) of the 4-D array like below image.

I don't know how to copy the not serialized array.
The reason I copy a part of array is because the original array size is over 10GB and I only need 10% of it.
So at first, I tried it with for loop. But it tooks too much time.
Is there any better idea..?

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray;
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));

    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
            }
        }
    }

    return 0;
}

Robert Crovella · Accepted Answer

For copying a subset of an array, cuda provides cudaMemcpy2D (can copy a single 2D section of a multidimensional array) and cudaMemcpy3D (can copy a single 3D section of a multidimensional array). You can find lots of questions here on the cuda tag to discover how to use those.

There are two problems with those approaches:

They don't necessarily extend to the 4D case. i.e. you might still need a loop
The performance of these operations (host<->device transfer speed) is often significantly lower than a cudaMemcpy operation that is copying the same number of bytes in aggregate

So there is no free lunch here. I believe the best suggestion is to create an extra "contiguous" buffer on the host, do all your slice-copying to that buffer, then copy that buffer from host to device in a single cudaMemcpy call. After that, if you still need the 4D representation on the device, then you will need to write a device kernel that "scatters" the data for you. Conceptually the reverse of the code you have shown.

Sorry, I'm not going to write all that code for you. However, I will rough out the first portion of it (getting everything copied to a single contiguous buffer on the device), using the code you have shown:

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray, *h_temp, *d_temp;
    size_t temp_sz = (int_x_end - int_x_begin)*(idx_ch_end - idx_ch_beg + 1)*(idx_z_end - idx_z_beg + 1)*(idx_y_end - idx_y_beg + 1);
    h_temp = (double *)malloc(temp_sz*sizeof(double));
    cudaMalloc(&d_temp, temp_sz*sizeof(double));
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
    size_t size_tr = 0;
    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                memcpy(h_temp+size_tr
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg));
                size_tr += (int_x_end - int_x_beg);
            }
        }
    }
    cudaMemcpy(d_temp, h_temp, temp_sz*sizeof(double), cudaMemcpyHostToDevice);
    // if necessary, put cuda kernel here to scatter data from d_temp to d_4dArray
    return 0;
}

after that, as indicated, if you have need of the 4D representation on the device, you will need a CUDA kernel to scatter the data for you.

How can I copy a part of 4D array from Host memory to Device memory?

Answers (1)

Related Questions