Dani Grosu
Dani Grosu

Reputation: 554

CUDA speed optimization

I have developed an application in CUDA for modular exponentiation and it performs very well for 512-bit integers. This multi precision integers are stored in 16 32-bit words.
Some concepts I use in order to achieve 2.5 - 3.2 speedup comparing to OpenSSL modular exponentiation approach:

All good by now, but trying to extend the integers to 1024 bits, the performance decreases dramatically to 0.1 - 0.3, and the only difference is the memory size needed to store an integer - now 32 x 32-bit words. Not to mention the 2048-bit version which is hundreds of times slower.

I have to say that when I want to compute 1000 modular exponentiations (r = a^x mod n), for example, I just send all the operands to my kernel, that means 512000 Bytes of memory.
My question: Why this minor changing is influencing the performance so much?
I use Nvidia Geforce GT 520mx, Ubuntu 14.04 64-bit.

Upvotes: 0

Views: 270

Answers (1)

Taro
Taro

Reputation: 798

Hard to tell without a minimal testing source code, but you could run into several limitations while increasing the size of your data :

  • Registers
  • Shared memory / L1 cache
  • Occupancy

And maybe a lot of others that I am forgetting.

Profiling your application could be very, very helpful. If you use Visual Studio, Nvidia NSIGHT can analyse the execution of your application and give you a lot of helpful information :

  • Blocks, threads, warps
  • Device theoretical and achieved occupancy
  • Multiprocessors activity
  • etc

And even draw some charts for you to view easily where is your bottleneck.

See my answer here on how to make Nsight run and analyse your application for performance analysis.

Upvotes: 2

Related Questions