Reputation: 26067

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.

I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).

A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).

If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?

Sorry if this question is too basic.

Upvotes: 0

Answers (3)

Tim Child

Reputation: 3012

The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

Upvotes: 0

djmj

Reputation: 5554

The Cuda Version at first is not that important. More important are the compute capabilities of your card.

If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.

Choose yourself your minimum cc level you need for your application.

Calculate necessary parameters using properties and use ptx jit compilation:

If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.

In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy. Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.

Some kernels used shared memory and its size was also deducted by the cc level properties.

This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!

All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)

All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.

This does not guarantees best possible performance!

For possible best performance each kernel should be analyzed regarding its parameters and resources.

This maybe is not practicable for all applications and requirements

Upvotes: 2

aland

Reputation: 5209

Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).

If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.

Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:

nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)

In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.

However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.

Upvotes: 3

CUDA development on different cards?

Answers (3)

Related Questions