Reputation:
I have some trouble with allocate array of arrays in CUDA.
void ** data;
cudaMalloc(&data, sizeof(void**)*N); // allocates without problems
for(int i = 0; i < N; i++) {
cudaMalloc(data + i, getSize(i) * sizeof(void*)); // seg fault is thrown
}
What did I wrong?
Upvotes: 5
Views: 14672
Reputation: 48330
You have to allocate the pointers to a host memory, then allocate device memory for each array and store it's pointer in the host memory. Then allocate the memory for storing the pointers into the device and then copy the host memory to the device memory. One example is worth 1000 words:
__global__ void multi_array_kernel( int N, void** arrays ){
// stuff
}
int main(){
const int N_ARRAYS = 20;
void *h_array = malloc(sizeof(void*) * N_ARRAYS);
for(int i = 0; i < N_ARRAYS; i++){
cudaMalloc(&h_array[i], i * sizeof(void*));
//TODO: check error
}
void *d_array = cudaMalloc(sizeof(void*) * N_ARRAYS);
// Copy to device Memory
cudaMemcpy(d_array, h_array, sizeof(void*) * N_ARRAYS, cudaMemcpyHostToDevice);
multi_array_kernel<1,1>(N_ARRAYS, d_array);
cudaThreadSynchronize();
for(int i = 0; i < N_ARRAYS; i++){
cudaFree(h_array[i]); //host not device memory
//TODO: check error
}
cudaFree(d_array);
free(h_array);
}
Upvotes: 11
Reputation: 121
I had the same Problem and managed to solve it.
FabrizioM's answer was a good point to start for me and helped me a lot. But nevertheless i encountered some problems when i tried to transfer the code to my project. Using the additional comments and posts i was able to write a working example (VS2012, CUDA7.5). Thus i will post my code as additional answer and as point to start for others.
To understand the naming: I'm using a vector of OpenCV cv::Mat as input which are captured from multiple cameras and i am processing these images in the Kernel.
void TransferCameraImageToCuda(const std::vector<cv::Mat*>* Images)
{
int NumberCams = Images->size();
int imageSize = Images->at(0)->cols*Images->at(0)->rows;
CUdeviceptr* CamArraysAdressOnDevice_H;
CUdeviceptr* CamArraysAdressOnDevice_D;
//allocate memory on host to store the device-address of each array
CamArraysAdressOnDevice_H = new CUdeviceptr[NumberCams];
// allocate memory on the device and store the arrays on the device
for (int i = 0; i < NumberCams; i++){
cudaMalloc((void**)&(CamArraysAdressOnDevice_H[i]), imageSize * sizeof(unsigned short));
cudaMemcpy((void*)CamArraysAdressOnDevice_H[i], Images->at(i)->data, imageSize * sizeof(unsigned short), cudaMemcpyHostToDevice);
}
// allocate memory on the device to store the device-adresses of the arrays
cudaMalloc((void**)&CamArraysAdressOnDevice_D, sizeof(CUdeviceptr*)* NumberCams);
// Copy the adress of each device array to the device
cudaMemcpy(CamArraysAdressOnDevice_D, CamArraysAdressOnDevice_H, sizeof(CUdeviceptr*)* NumberCams, cudaMemcpyHostToDevice);
}
In the kernel launch I'm casting the device pointer to the data type pointer (unsigned short**)
DummyKernel<<<gridDim,blockDim>>>(NumberCams, (unsigned short**) CamArraysAdressOnDevice_D)
and the kernel definition is for example:
__global__ void DummyKernel(int NumberImages, unsigned short** CamImages)
{
int someIndex = 3458;
printf("Value Image 0 : %d \n", CamImages[0][someIndex]);
printf("Value Image 1 : %d \n", CamImages[1][someIndex]);
printf("Value Image 2 : %d \n", CamImages[2][someIndex]);
}
Upvotes: 2
Reputation: 21
you cannot use
cudaMalloc(&h_array[i], i * sizeof(void*));
for array declared as void *
use defined data type
CUdeviceptr *h_array = malloc(sizeof(CUdeviceptr *) * N);
or
int *h_array = malloc(sizeof(int *) * N);
and cast it to void *
cudaMalloc((void *)&h_array[i], i * sizeof(void*));
Upvotes: 2
Reputation: 21
I think in the first loop it should be &h_array[i]
not &d_array[i]
.
Upvotes: 2
Reputation: 1443
I don't believe this is supported. cudaMalloc()
allocates device memory, but stores the address in a variable on the host. In your for-loop, you are passing it addresses in device memory.
Depending on what you're trying to accomplish, you may want to allocate data
with normal host malloc()
before calling the for-loop as you currently have it. Or allocate a single big block of device memory and compute offsets into it manually.
Look at Sections 2.4, 3.2.1 and B.2.5 (bottom) of the CUDA Programming Guide for more discussion of this. Specifically, on the bottom of page 108:
The address obtained by taking the address of a
__device__
,__shared__
or__constant__
variable can only be used in device code.
Upvotes: 4