Reputation: 153
I have flattened 4-D array in Host array.
And I want to copy a part(red region) of the 4-D array like below image.
I don't know how to copy the not serialized array.
The reason I copy a part of array is because the original array size is over 10GB and I only need 10% of it.
So at first, I tried it with for loop. But it tooks too much time.
Is there any better idea..?
int main(){
int nx = 100; ny = 200; nz = 300; nch = 400;
int idx_x_beg = 50; int_x_end = 100;
int idx_y_beg = 100; int_y_end = 200;
int idx_z_beg = 150; int_z_end = 300;
int idx_ch_beg = 200; int_ch_end = 400;
double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
double *d_4dArray;
cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
, h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
, sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
}
}
}
return 0;
}
Upvotes: 1
Views: 469
Reputation: 152164
For copying a subset of an array, cuda provides cudaMemcpy2D
(can copy a single 2D section of a multidimensional array) and cudaMemcpy3D
(can copy a single 3D section of a multidimensional array). You can find lots of questions here on the cuda
tag to discover how to use those.
There are two problems with those approaches:
cudaMemcpy
operation that is copying the same number of bytes in aggregateSo there is no free lunch here. I believe the best suggestion is to create an extra "contiguous" buffer on the host, do all your slice-copying to that buffer, then copy that buffer from host to device in a single cudaMemcpy
call. After that, if you still need the 4D representation on the device, then you will need to write a device kernel that "scatters" the data for you. Conceptually the reverse of the code you have shown.
Sorry, I'm not going to write all that code for you. However, I will rough out the first portion of it (getting everything copied to a single contiguous buffer on the device), using the code you have shown:
int main(){
int nx = 100; ny = 200; nz = 300; nch = 400;
int idx_x_beg = 50; int_x_end = 100;
int idx_y_beg = 100; int_y_end = 200;
int idx_z_beg = 150; int_z_end = 300;
int idx_ch_beg = 200; int_ch_end = 400;
double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
double *d_4dArray, *h_temp, *d_temp;
size_t temp_sz = (int_x_end - int_x_begin)*(idx_ch_end - idx_ch_beg + 1)*(idx_z_end - idx_z_beg + 1)*(idx_y_end - idx_y_beg + 1);
h_temp = (double *)malloc(temp_sz*sizeof(double));
cudaMalloc(&d_temp, temp_sz*sizeof(double));
cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
size_t size_tr = 0;
for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
memcpy(h_temp+size_tr
, h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
, sizeof(double)*(int_x_end - int_x_beg));
size_tr += (int_x_end - int_x_beg);
}
}
}
cudaMemcpy(d_temp, h_temp, temp_sz*sizeof(double), cudaMemcpyHostToDevice);
// if necessary, put cuda kernel here to scatter data from d_temp to d_4dArray
return 0;
}
after that, as indicated, if you have need of the 4D representation on the device, you will need a CUDA kernel to scatter the data for you.
Upvotes: 2