Reputation: 4718
I have a rather complicated c++ class which has functions of the sort myObj.fun(x,y,z)
. I would like to call this on a GPU on a 3D grid of points.
My high level question is: is it a simple matter to pass myObj
and a large number of points to the GPU? I was avoiding creating a CUDA implementation of this function because of the work, but it occurs to me this may be very simple.
Again, this is a very high level question so "yes, it's that easy" or "no, it's more complicated" are welcome, though a bit if direction would also be helpful.
Upvotes: 0
Views: 1212
Reputation: 151799
It's possible to use classes/objects on the GPU, including their methods (e.g. fun()
). Such classes would at a minimum have to have methods that are decorated with __host__ __device__
, but the code refactoring may not be any more involved than that.
However, such methods (like any other code with no refactoring) will probably not access any parallel power of the GPU. The most basic comparison would be that methods that run in a single CPU thread would then run in a single GPU thread. This normally is not faster and frequently would be a lot slower, if all you did was pass a single object to the GPU and run the equivalent single-threaded code on the GPU (in a single GPU thread).
One possible strategy is that if you have a great deal of these objects, or equivalently, in your case, a great deal of "points", that represent independent work to be done on each, then you could pass each one (object, or point) to a GPU thread, and work on them that way, so as to enable massively multithreaded operation, which GPUs like. Ideally you would have 10,000 or more points to process this way.
This still usually isn't the most efficient use of the GPU for a variety of reasons, one of which has to do with efficient data access, and another has to do with (possible) thread divergence. Nevertheless, some people do pursue this "simple", "embarassingly parallel" approach to code porting, occasionally with interesting speed-ups.
Depending on your actual code, you might see attractive results if you pass the points to the GPU in a fashion that allows adjacent threads to access adjacent data, for each operation that accesses the points. It's quite reasonable to expect that you might get an attractive speed up in that fashion, perhaps with relatively little code refactoring, but attention to data organization for optimal GPU access.
Here's a fully worked example:
$ cat t30.cu
#include <iostream>
#include <cstdlib>
const int dsize = 3;
const int nTPB = 256;
const int rng = 8;
class myclass
{
int increment;
public:
myclass(int _incr): increment(_incr) {};
// methods callable on the device need the __device__ decoration
__host__ __device__ void fun(int &x, int &y, int &z){
x += increment;
y += increment;
z += increment;}
};
// this is the actual device routine that is run per thread
__global__ void mykernel(myclass obj, int *dx, int *dy, int *dz, int dsize){
int idx = threadIdx.x+blockDim.x*blockIdx.x; // figure out which thread we are
if (idx < dsize)
obj.fun(dx[idx], dy[idx], dz[idx]); // apply method
}
int main(){
// allocate host data
int *p_x, *p_y, *p_z, *d_x, *d_y, *d_z;
p_x = new int[dsize];
p_y = new int[dsize];
p_z = new int[dsize];
// allocate device data
cudaMalloc(&d_x, dsize*sizeof(int));
cudaMalloc(&d_y, dsize*sizeof(int));
cudaMalloc(&d_z, dsize*sizeof(int));
// initialize host data
std::cout << "Before:" << std::endl;
for (int i = 0; i < dsize; i++){
p_x[i] = rand()%rng;
p_y[i] = rand()%rng;
p_z[i] = rand()%rng;
std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}
// copy to device
cudaMemcpy(d_x, p_x, dsize*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, p_y, dsize*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_z, p_z, dsize*sizeof(int), cudaMemcpyHostToDevice);
// instantiate object on host
myclass test(1);
// copy object to device as kernel parameter
mykernel<<<(dsize+nTPB-1)/nTPB, nTPB>>>(test, d_x, d_y, d_z, dsize);
// copy data back to host
cudaMemcpy(p_x, d_x, dsize*sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(p_y, d_y, dsize*sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(p_z, d_z, dsize*sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "After:" << std::endl;
for (int i = 0; i < dsize; i++){
std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}
return 0;
}
$ nvcc -o t30 t30.cu
$ ./t30
Before:
7,6,1
3,1,7
2,4,1
After:
8,7,2
4,2,8
3,5,2
$
For brevity of presentation, I've omitted proper cuda error checking but I would always recommend you use it when you are developing CUDA codes.
Upvotes: 4