CUDA speed optimization

Question

I have developed an application in CUDA for modular exponentiation and it performs very well for 512-bit integers. This multi precision integers are stored in 16 32-bit words.
Some concepts I use in order to achieve 2.5 - 3.2 speedup comparing to OpenSSL modular exponentiation approach:

__shared__ memory
CUDA memory align
PTX code for 32-bit addition, multiplication
unrolling

All good by now, but trying to extend the integers to 1024 bits, the performance decreases dramatically to 0.1 - 0.3, and the only difference is the memory size needed to store an integer - now 32 x 32-bit words. Not to mention the 2048-bit version which is hundreds of times slower.

I have to say that when I want to compute 1000 modular exponentiations (r = a^x mod n), for example, I just send all the operands to my kernel, that means 512000 Bytes of memory.
My question: Why this minor changing is influencing the performance so much?
I use Nvidia Geforce GT 520mx, Ubuntu 14.04 64-bit.

Taro · Accepted Answer

Hard to tell without a minimal testing source code, but you could run into several limitations while increasing the size of your data :

Registers
Shared memory / L1 cache
Occupancy

And maybe a lot of others that I am forgetting.

Profiling your application could be very, very helpful. If you use Visual Studio, Nvidia NSIGHT can analyse the execution of your application and give you a lot of helpful information :

Blocks, threads, warps
Device theoretical and achieved occupancy
Multiprocessors activity
etc

And even draw some charts for you to view easily where is your bottleneck.

See my answer here on how to make Nsight run and analyse your application for performance analysis.

CUDA speed optimization

Answers (1)

Related Questions