Optimizing nested loops

Question

We have an assignment where we're given a very inefficient program and we have to optimize the code to make it run faster. I've got most of them going pretty fast, except for these two, which bugs me so much because they are VERY simply functions. One basically sets all values in a two dimensional array to the same value and one basically swaps the values of two 2-D arrays. And that's the problem. They're taking up the most time but because they're so simple, I can't figure out how to reduce them without breaking the functions. And I know it's possible to make them run faster, because other students in the class have been getting ridiculous speed ups. The two in question are:

fSetArray(int rows, int cols, float val)
{
    int i, j;
    F2D *out;
    out = fMallocHandle(rows, cols);

    for(i=0; i



The only important part is the two loops there. Basically, we have a 2-D array with a certain width (rows) and a certain height (cols) and we're setting ALL of the values to val. But there's no way that I can see to eliminate one of the loops (which would be the best way to speed it up). Unless I'm missing something obvious. If cols and rows were the same number, this would be so much easier.

The other function is:

fDeepCopy(F2D* in)
{
    int i, j;
    F2D* out;
    int rows, cols;

    rows = in->height;
    cols = in->width;

    out = fMallocHandle(rows, cols);

    for(i=0; i


The out array is always larger than the in array, so we don't have to worry about overflow or such. This function basically just swaps the values of the two arrays, but once again, because it's so simple, I can't quite get it to reduce any further.

And before anyone says parallelization, because of the sample size and server, the overhead required to create the threads actually slows down the program. I've tried. Because these two functions are too short, it's not worth creating the threads only to kill them after one parallel for. 

But for reference, yes technically this is an OpenMP project and we're supposed to use parallelization, but for these two functions, doing so actually slows down the whole program. I've timed it with a parallel for loop to check.

EDIT: Thanks for everyone who gave me ideas! I've got it up and running and full speed now!

Edwin Buck · Accepted Answer

One type of optimization is loop unrolling. Occasionally the pipeline needs to stall because there is a lot of activity around obtaining the index, updating it, and storing it back in memory. This is probably the primary reason your multithreaded implementation didn't do well, all the threads probably were fighting over access to the index.

If you want to reattempt a multithreaded implementation, have each thread know it's "offset" based on the thread count, and have each thread process a different remainder discovered via modulus division

thread 0 works on i*rows+j % (thread count) = 0
thread 1 works on i*rows+j % (thread count) = 1
(and so on)

If you do not want to reattempt a multithreaded implementation, there are still some techniques that can improve performance. Sometimes, even a single thread can unnecessairly stall on an index variable (because it causes inefficent use of the processor pipeline). To attempt fixes for this type of problem, increment the index by a particular number and call the method that number of times within the loop.

fDeepCopy(F2D* in)
{
    int i, j;
    F2D* out;
    int rows, cols;

    rows = in->height;
    cols = in->width;

    out = fMallocHandle(rows, cols);

    for(i=0; i



Now CPU designers are getting better, so it is critical that you actually profile your runs to determine if the CPU already does such optimization for your, or such a technique might actually slow down your code.

Finally, you can also leverage SSE extensions, which under the right conditions can perform the same operation on multiple values all stored in a MMX register.  For example, you can use inlining of assembly to pack two sets of four 32-bit single-precision floating point numbers into MMX registers, and the perform the addition in bulk, four floating points at a time.

So, something that looks like 

vec_res.x = v1.x + v2.x;
vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z;
vec_res.w = v1.w + v2.w;


which would require eight memory lookups, four additions, and four stores (16 operations unoptimized) might be replaced with

movaps xmm0, [v1]          ;xmm0 = v1.w | v1.z | v1.y | v1.x 
addps xmm0, [v2]           ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x               
movaps [vec_res], xmm0


which only requires three operations, unoptimized.

Again the key is profiling, so you can detect bottlenecks and then adjust your code to bring them within minimal acceptable performance.

Optimizing nested loops

Answers (2)

Related Questions