Reputation: 157
I've these structs:
typedef struct neuron
{
float* weights;
int n_weights;
}Neuron;
typedef struct neurallayer
{
Neuron *neurons;
int n_neurons;
int act_function;
}NLayer;
"NLayer" struct can contain an arbitrary number of "Neuron"
I've tried to allocate a 'NLayer' struct with 5 'Neurons' from the host in this way:
NLayer* nL;
int i;
int tmp=9;
cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));
for(i=0;i<5;i++)
cudaMemcpy(&nL->neurons[i].n_weights,&tmp,sizeof(int),cudaMemcpyHostToDevice);
...then I've tried to modify the "nL->neurons[0].n_weights" variable with that kernel:
__global__ void test(NLayer* n)
{
n->neurons[0].n_weights=121;
}
but at compiling time nvcc returns that "warning" related to the only line of the kernel:
Warning: Cannot tell what pointer points to, assuming global memory space
and when the kernel finish its work the struct begin unreachable.
It's very probably that I'm doing something wrong during the allocation....can someone helps me?? Thanks very much, and sorry for my english! :)
UPDATE:
Thanks to aland I've modified my code creating this function that should allocate an instance of the struct "NLayer":
NLayer* setNLayer(int numNeurons,int weightsPerNeuron,int act_fun)
{
int i;
NLayer h_layer;
NLayer* d_layer;
float* d_weights;
//SET THE LAYER VARIABLE OF THE HOST NLAYER
h_layer.act_function=act_fun;
h_layer.n_neurons=numNeurons;
//ALLOCATING THE DEVICE NLAYER
if(cudaMalloc((void**)&d_layer,sizeof(NLayer))!=cudaSuccess)
puts("ERROR: Unable to allocate the Layer");
//ALLOCATING THE NEURONS ON THE DEVICE
if(cudaMalloc((void**)&h_layer.neurons,numNeurons*sizeof(Neuron))!=cudaSuccess)
puts("ERROR: Unable to allocate the Neurons of the Layer");
//COPING THE HOST NLAYER ON THE DEVICE
if(cudaMemcpy(d_layer,&h_layer,sizeof(NLayer),cudaMemcpyHostToDevice)!=cudaSuccess)
puts("ERROR: Unable to copy the data layer onto the device");
for(i=0;i<numNeurons;i++)
{
//ALLOCATING THE WEIGHTS' ARRAY ON THE DEVICE
cudaMalloc((void**)&d_weights,weightsPerNeuron*sizeof(float));
//COPING ITS POINTER AS PART OF THE i-TH NEURONS STRUCT
if(cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice)!=cudaSuccess)
puts("Error: unable to copy weights' pointer to the device");
}
//RETURN THE DEVICE POINTER
return d_layer;
}
and i call that function from the main in that way (the kernel "test" is previously declared):
int main()
{
NLayer* nL;
int h_tmp1;
float h_tmp2;
nL=setNLayer(10,12,13);
test<<<1,1>>>(nL);
if(cudaMemcpy(&h_tmp1,&nL->neurons[0].n_weights,sizeof(float),cudaMemcpyDeviceToHost)!=cudaSuccess);
puts("ERROR!!");
printf("RESULT:%d",h_tmp1);
}
When I compile that code the compiler show me the Warning, and when I execute the program it print on screen:
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
ERROR!!
RESULT:1
The last error doesn't not compare if I comment the kernel call.
Where I'm wrong? I do not know how to do Thanks for your help!
Upvotes: 5
Views: 4342
Reputation: 5209
The problem is here:
cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));
In first line, nL
is pointing to structure in global memory on device.
Therefore, in second line the first argument to cudaMalloc
is address residing on GPU, which is undefined behaviour (on my test system, it causes segfault; in your case, though, there is something more subtle).
The correct way to do what you want is first to create structure in host memory, fill it with data, and then copy it to device, like this:
NLayer* nL;
NLayer h_nL;
int i;
int tmp=9;
// Allocate data on device
cudaMalloc((void**)&nL, sizeof(NLayer));
cudaMalloc((void**)&h_nL.neurons, 6*sizeof(Neuron));
// Copy nlayer with pointers to device
cudaMemcpy(nL, &h_nL, sizeof(NLayer), cudaMemcpyHostToDevice);
Also, don't forget to always check for any errors from CUDA routines.
UPDATE
In second version of your code:
cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,...)
--- again, you are dereferencing device pointer (d_layer
) on host. Instead, you should use
cudaMemcpy(&h_layer.neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice
Here you take h_layer
(host structure), read its element (h_layer.neurons
), which is pointer to device memory. Then you do some pointer arithmetics on it (&h_layer.neurons[i].weights
). No access to device memory is needed to compute this address.
Upvotes: 6
Reputation: 491
It all depends on the GPU card your using. The Fermi card uses uniform addressing of shared and global memory space, while pre-Fermi cards don't.
For the pre-Fermi case, you don't know if the address should be shared or global. The compiler can usually figure this out, but there are cases where it can't. When a pointer to shared memory is required, you usually take an address of a shared variable and the compiler can recognise this. The message "assuming global" will appear when this is not explicitly defined.
If you are using a GPU that has compute capabiilty of 2.x or higher, it should work with the -arch=sm_20 compiler flag
Upvotes: 0