sm-level : 1.3 vs 2.0 performance

Question

My code doesn't depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes! sm20 -> 35ms sm13 -> 25ms

After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:

If I use sm13 with "no fast math generation" (further fm - fast math), I have 25ms
If I use sm13 with fm, I have 25ms
sm20 without fm = 35ms
sm20 with fm = 25ms (that is the same result)

Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?

Tom · Accepted Answer

In addition to compiling in release mode, as pointed out by @Robert Crovella, you should also consider that when you target sm_13 the compiler is able to simplify some of the floating point maths. sm_20 and later supports precise division, precise square root, and denormals by default.

You can try disabling these features with the command line options -ftz=true -prec-div=false -prec-sqrt=false. See the best practices guide for more information.

sm-level : 1.3 vs 2.0 performance

Answers (1)

Related Questions