Reputation: 193
I have a working conjungate gradient method implementation in pycuda, that I want to optimize. It uses a self written matrix-vector-multiplication and the pycuda-native gpuarray.dot
and gpuarray.mul_add
functions
Profiling the program with kernprof.py/line_profiler
returned most time (>60%) till convergence spend in one gpuarray.dot()
call. (About .2 seconds)
All following calls of gpuarray.dot()
take about 7 microseconds. All calls have the same type of input vectors (size: 400 doubles)
Is there any reason why? I mean in the end it's just a constant, but it is making the profiling difficult.
I wanted to ask the question at the pycuda mailing list. However I wasn't able to subscribe with an @gmail.com adress. If anyone has either an explanation for the strange .dot()
behavior or my inability to subscribe to that mailing list please give me a hint ;)
Upvotes: 1
Views: 792
Reputation: 48330
One reason would be that Pycuda is compiling the kernel before uploading it. As far as I remember thought that should happen only the very first time it executes it.
One solution could be to "warm up" the kernel by executing it once and then start the profiling procedure.
Upvotes: 2