Reputation: 1892
My code doesn't depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes! sm20 -> 35ms sm13 -> 25ms
After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:
Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?
Upvotes: 1
Views: 211
Reputation: 21128
In addition to compiling in release mode, as pointed out by @Robert Crovella, you should also consider that when you target sm_13 the compiler is able to simplify some of the floating point maths. sm_20 and later supports precise division, precise square root, and denormals by default.
You can try disabling these features with the command line options -ftz=true -prec-div=false -prec-sqrt=false
. See the best practices guide for more information.
Upvotes: 2