Reputation: 233
i profiled my matlab code in order to identify most consuming time functions they are mostly gradient, Kron matlab functions in this filein order to write them into cuda kernels then PTX them and call them from matlab.Any idea or articles will be good.also the calcution of m and b seem to be separable make them good candidate to be assign to different blocks,here is a snap of the code from the file
i2w=g0*aff(i2,a0);
[ix,iy]=grad(i2w);
ix=ix.*region;iy=iy.*region;
ix2=ix.^2;iy2=iy.^2;ixiy=ix.*iy;
it=i1-i2w;
m1=sum(sum(kron(ones(1,limy)',(1-centx:limx-centx).^2).*ix2));
m2=sum(sum(kron((1-centy:limy-centy)',(1-centx:limx-centx)).*ix2));
ps: i recently read about NVMEX or so a little help about this option on such code-previously mentioned- will be appreciated.
Upvotes: 1
Views: 315
Reputation: 468
This is a question that is too long to answer in a single post, but i'll give You two hints.
If you depend on the performance of this code enough to spend some 2 weeks writing and testing the CUDA code, let me tell you about my approach to accelerating Matlab code:
Hint 1:
Start by re-writing the function in question in such way, ( in matlab) that it uses only loops, memory access, and basic functions that can be found in CUDA manual, like add, multiply etc. for example in the pseudo-matlab-code
function result_array = MyFunctionToParallelise(constants,source_arrays)
for x_idx=xcoords
for y_idx=ycoords
local_result=inner_function(x_idx,y_idx,constants,source_arrays(x_idx,y_idx));
store(local_result to result_array(x_idx,y_idx));
end
end
If you do that and your "inner_function" is parallelisable (is independent of other local_results, and can be obtained in any order of x_idx,y_idx etc. ) you are at home!
Write your "inner_function" in C (you do know C and MEX, right?) , and make sure it is compilable, returns correct result, and works in mex file using regular loop for inner y_idx and OpenMP-ized loop for outer x_idx loop. If you do that, you will often get an acceleration of 4x! (due to openMP on a 4-core CPU). No need for toolboxes and other paid stuff - you get that in Matlab and MEX by default.
Write a CUDA launcher for "inner_function". No need for commercial toolboxes. This is the easy part! simply replace the "for loops" with threads and blocks. . . . and insert this into your mex file where you used to have your regular function before. Expect 10x - 100x acceleration over C at this step.
Following this approach, you will be able to debug and verify correctness at every small step. In my experience, typos in the code that manages buffer pointers and buffer sizes is the main source of crashes and wrong results. No point in obtaing the WRONG result really fast!.
Hint no.2: For some complex functions (like kron), if your input and output is of fixed size, it might be possible to obtain register-level optimized, linear, non-iterative, non-branching code using computer algebra system like Wolfram Mathematica. Such code executes super-fast on GPU. Example: Example use of Mathematica's formula optimising compiler
Upvotes: 1