Optimization Loop unrolling to find the inverse of a matrix by the gauss jordan method

I am trying to apply the loop unrolling to find the inverse of a matrix by the Gauss Jorda method, to reduce the number of memory accesses (bottleneck) when the size of the matrices is very large and does not fit in the caches. I get it to go faster, but the result I get is wrong and I don't know why.

for(k=0;k<size;k++)                                  
{                                                       
    
    pivot=original[k][k];
    for(j=0;j<size;j++)                             
    {
        original[k][j]/=pivot;                                  
        inverse[k][j]/=pivot;                                   
    }   
            
    for(i=0;i<size;i++)                                 
    {
           
        if(i!=k)
        {
            pivot = original[i][k];                                 
            for(j=0;j<size;j++)                         
            {                                                                                   
                original[i][j] -= original[k][j]*pivot;                     
                inverse[i][j] -= inverse[k][j]*pivot;   
                                            
            }
        }
            
    }     
}

I hope that the execution of the problem is faster by receiving the number of memory accesses.

my loop unrrolling version is the following:

for(k=0;k<sizeOfMatrix;k += 2)                                   
{                                                           
    pivot=original[k][k];
    for(j=0;j<sizeOfMatrix;j++)                             
    {
        original[k][j]/=pivot;                                  
        inverse[k][j]/=pivot;                                   
    }   
           
    for(i=0;i<sizeOfMatrix;i++)                                 
    {
       
        if(i!=k && i!=k+1)
        {
            pivot = original[i][k];                                 
            for(j=0;j<sizeOfMatrix;j++)                         
            {                                                                                   
                original[i][j] -= original[k][j]*pivot;                     
                inverse[i][j] -= inverse[k][j]*pivot;                              
            }
        }
               
        if(i!=k+1)
        {
            pivot=original[k][k];
            for(j=0;j<sizeOfMatrix;j++)                             
            {
                original[k+1][j]/=pivot;                                    
                inverse[k+1][j]/=pivot;                                 
            }
                              
            pivot = original[i][k+1];                                   
            for(j=0;j<sizeOfMatrix;j++)                         
            {                                                                                   
                original[i][j] -= original[k+1][j]*pivot;                       
                inverse[i][j] -= inverse[k+1][j]*pivot;                              
            }
        }        

    }     
        
}

Upvotes: 0

Answers (2)

subject25

Reputation: 1

You already found the solution to the problem. As the objective is academic. The goal is to reduce the execution time, reduce the number of memory accesses, in my case reduce it by half.

I modify the row k, I divide the elements of the row by its pivot
Modify the row k+1 of the matrices, with the new row k
Modified row k+1, divides the elements of the row by its pivot.
I make the modifications of the other rows with the values k and k + 1

    for (k = 0; k < size; k += 2)
    {
        pivot = original[k][k];
        for (j = 0; j < size; j++)
        {
            original[k][j] /= pivot;
            inverse[k][j] /= pivot;

        }
        pivot = original[k + 1][k];
        for (i = 0; i < size; i++)
        {
            original[k + 1][i] -= original[k][i] * pivot;
            inverse[k + 1][i] -= inverse[k][i] * pivot;
        }
        pivot = original[k+1][k+1];

        for (j = 0; j < size; j++)
        {
            original[k+1][j] /= pivot;
            inverse[k+1][j] /= pivot;
        }
        for (i = 0; i < size; i++)
        {
            if (i != k && i != k + 1)
            {
                pivot = original[i][k];
                
                    for (j = 0; j < size; j++)
                    {
                        original[i][j] -= original[k][j] * pivot;
                        inverse[i][j] -= inverse[k][j] * pivot;
                    }
            }

            if (i != k + 1)
            {
                pivot = original[i][k+1];
                for (j = 0; j < size; j++)
                {
                    original[i][j] -= original[k + 1][j] * pivot;
                    inverse[i][j] -= inverse[k + 1][j] * pivot;

                }
            }
        }
    }

Upvotes: 0

John Bollinger

Reputation: 181179

For the moment, let's leave aside the questions of the numerical fitness of your original code and of whether the unrolling is an exercise you should be attempting at all. You show an attempt to unroll the outer loop by rewriting the body of the middle loop. Although a computation equivalent to the original could be structured along the lines similar to those of the transformed result, it would not constitute an unrolling because loop unrolling does not change order of the operations performed in the loop body over the range of loop iterations. The transformation you have made does produce such changes.

But in your case, the transformed code is simply wrong, at least because

when sizeOfMatrix is odd, it performs out-of-bounds array accesses during the k-loop iteration where k == sizeOfMatrix - 1.
it uses the wrong pivot value at the beginning of the if(i!=k+1) block.
the correct value would, in principle, be stored at original[k+1][k+1], but the needed value has not yet been computed on iterations where i <= k.
the if(i!=k+1) block scales row k+1 of the original and inverse matrices for every i other than k+1, but those should be scaled only once each.

Without further consideration of whether a transformation along the lines you've attempted might lend itself to a performance improvement, I observe that if you wanted to perform a genuine unrolling then that's the wrong loop to be trying to unroll. If there were any advantage to be gained by a true unrolling then it would come from unrolling the innermost loop. Like this, for example:

for (k = 0; k < size; k++) {                                                       
    // ...
    for (i = 0; i < size; i++) {
        // ...
            for (j = 0; j < size ^ 1; j += 2) {
                // two rows per iteration                                                                                  
                original[i][j] -= original[k][j] * pivot;
                original[i][j + 1] -= original[k][j + 1] * pivot;
                inverse[i][j] -= inverse[k][j] * pivot;
                inverse[i][j + 1] -= inverse[k][j + 1] * pivot;
            }
            if (size & 1) {
                // a single additional row when needed
                assert(j == size - 1);
                original[i][j] -= original[k][j] * pivot;
                inverse[i][j] -= inverse[k][j] * pivot;
            }
        // ...
    }
}

But as I remarked in comments, if your compiler expected such an unrolling to speed up the program then it would do it automatically when you compile with optimization enabled. And the compiler's judgement about such things is pretty good.

In fact, hand-optimizations can easily worsen performance instead of improving it, either inherently or because they interfere with the compiler performing more effective optimizations itself.

If best speed is your objective then you should probably use a well-tuned third-party implementation, such as from lapack or atlas.

If you insist on rolling your own, then you likely could get more speedup simply by improving the algorithm (even sticking with G-J), but you should probably give some of that back to obtain improved numeric stability. Specifically,

at each outer-loop iteration, choose the available row with the largest-magnitude value in the current elimination column, and swap it, if necessary, with the current row. This will minimize the incidence of dividing large numbers by very small numbers, especially zero.
(after (1), above,) in the original matrix, compute updates only for the elements in columns to the right of the current pivot. The other elements of the original have already had all the influence on the result that they ever will have. (But you cannot shortcut this way with the inverse matrix.)

If you do not perform some variation on (1) then your inversion will fail on some matrices that are, in fact, invertible.

Something along these lines, for example:

// Temporary space for row swaps
element_type *tmp = malloc(size * sizeof(*tmp));
if (!tmp) {
    abort();  // you probably want something more graceful here
}

for (int pivot_col = 0; pivot_col < size; pivot_col++) {

    // Choose the remaining row that maximizes the magnitude of the pivot element

    int pivot_row = pivot_col;
    element_type pivot = fabs(original[pivot_row][pivot_col]);  // or fabsf() or fabsl() ...
                                                     
    for (row = pivot_row + 1; row < size; row++) {
        element_type test_el = fabs(original[row][pivot_col]);

        if (test_el > pivot) {
            pivot_row = row;
            pivot = test_el;
        }
    }

    if (pivot_row != pivot_col) {

        // This assumes array-of-array structure.
        // There are better options for array-of-pointers structure.

        int current_row = pivot_col;

        // Only the tails of the original rows, starting at pivot_col, need to be swapped
        memcpy(tmp, &original[current_row][pivot_col], (size - pivot_col) * sizeof(*tmp));
        memcpy(&original[current_row][pivot_col], &original[pivot_row][pivot_col], (size - pivot_col) * sizeof(*tmp));
        memcpy(&original[pivot_row][pivot_col], tmp, (size - pivot_col) * sizeof(*tmp));

        // The full inverse rows need to be swapped
        memcpy(tmp, inverse[current_row], sizeof(inverse[current_row]));
        memcpy(inverse[current_row], inverse[pivot_row], sizeof(inverse[current_row]));
        memcpy(inverse[pivot_row], tmp, sizeof(inverse[current_row]));
    }
    pivot_row = pivot_col;
    pivot = original[pivot_row][pivot_col];  // get the correctly-signed pivot value

    // All columns of the inverse row need to be scaled, but
    // only the tail columns of the original row need to be.
    int col = 0;

    for (; col < pivot_col + 1; col++) {
        inverse[pivot_row][col] /= pivot;                                   
    }   
    for (; col < size; col++) {
        original[pivot_row][col] /= pivot;                                  
        inverse[pivot_row][col] /= pivot;                                   
    }   
            
    // Compute the effect on the other rows of canceling the pivot column
    for (int row = 0; row < size; row++) {
        if (row == pivot_row) continue;

        element_type cancel_val = original[row][pivot_col];

        // where col < pivot_col, original[pivot_row][col] is (logically) zero
        // where col == pivot_col, we are (logically) setting original[row][col] to zero
        for (col = 0; col <= pivot_col; col++) {
            inverse[row][col] -= inverse[pivot_row][col] * cancel_val;   
        }
        // we need actually to write updates to the remaining columns of the original
        for(; col < size; col++) {
            original[row][col] -= original[pivot_row][col] * cancel_val;
            inverse[row][col] -= inverse[pivot_row][col] * cancel_val;
        }
    }     
}

free(tmp);

That's inspired by your first code, with improved names and an implementation of algorithmic improvements (1) and (2).

Upvotes: 0

Optimization Loop unrolling to find the inverse of a matrix by the gauss jordan method

Answers (2)

Related Questions