How to parallelize this recursive algorithm

Question

Given a float matrix A[N,M] and a float array T[M], i would like to parallelize the following code using OpenMP:

for ( int i = 0; i < N; i++ )
{
    T[0] = A[i][0];
    for ( int j = 1; j < M; j++ )
        T[j] = f(A[i][j]-T[j-1]) + A[i][j];

    float v = T[M-1];
    for ( int j = M-2; j >=0; j-- )
    {
        v = g(A[i][j]-v);
        A[i][j] = T[j] + v;
    }

}

where f and g are non linear functions. This is a recursive filter. My first attempt (i am new to OpenMP) was to parallelize the external loop and allocate T as a matrix of dimension #threads X M, where #threads=omp_get_max_threads():

#pragma omp parallel for
for ( int i = 0; i < N; i++ )
{
    Let T1 point to the row of T indexed by the current thread (given by omp_get_thread_num())

    T1[0] = A[i][0];
    for ( int j = 1; j < M; j++ )
        T1[j] = f(A[i][j]-T1[j-1]) + A[i][j];

    float v = T1[M-1];
    for ( int j = M-2; j >=0; j-- )
    {
        v = g(A[i][j]-v);
        A[i][j] = T1[j] + v;
    }

}

In this way, every thread uses its own T memory. I tested it on my 8-cores i7 CPU and the speedup is about 5x, but on my 4-cores smartphone CPU I get a really small speed up (like 1.3x - 1.5x). My questions are:

I expected the speed up on Android to be around 2x, instead the range is 1.3-1.5x. Is this a reasonable speed up or should it be more?
Is there a better way to parallelize this sort of recursive matrix filter?

Update 1.

To test it, I used MS Visual Studio with OpenMP enabled on the 8-cores i7 machine, while on my smartphone I use the toolchain provided by Android NDK, using the -fopen -O3 -funroll-loops flags. I use QueryPerformanceCounter on Win, and the following code on Android:

///
double  CUtil::GetTimeMs()
{
    struct timespec res;
    clock_gettime(CLOCK_REALTIME, &res);
    return 1000.0 * res.tv_sec + (double) res.tv_nsec / 1e6;
}

Update 2. I have updated the speed up numbers on Android.

Gilles · Accepted Answer

My feeling is that you had the right approach for the parallelisation.

However, since your T array seems to be only used as temporary storage, instead of allocating it as a 2D array with first dimension being the number of threads to use, I would just allocate it inside the parallel region as private. This shouldn't make much of a difference, but might simplify the job of ensuring data locality in a NUMA environment, and avoid any potential false sharing.

The code would look like this:

#pragma omp parallel
{
    float *T = new float[M];
    #pragma omp for schedule( static )
    for ( int i = 0; i < N; i++ ) {
        float *AA = A[i]; // this "optimisation" is arguable and can be ignored
        T[0] = AA[0];
        for ( int j = 1; j < M; j++ ) {
            T[j] = f( AA[j] - T[j-1] ) + AA[j];
        }
        float v = T1[M-1];
        for ( int j = M-2; j >= 0; j-- ) {
            v = g( AA[j] - v );
            AA[j] = T[j] + v;
        }
    }
    delete[] T;
}

Now, whether it will make any difference performance-wise on your code, well I'm sceptical. However, I draw your attention to the use of clock() as a timer for OpenMP code (and multi-threaded ones globally speaking). The point is that (on POSIX systems at least IINM) clock() return the CPU time of the current thread and its children. Again, IINM, on Windows, the same function returns the CPU time of the calling thread only.
So if by any chance, your PC is on Windows and your mobile phone is on Android, on the former platform you print the CPU for one thread only while of the latter you print the cumulated CPU time of all threads...

This is very speculative idea, but in any case I couldn't encourage you enough to just drop clock() and to use omp_get_wtime() which returns the elapsed wall clock time, which is what you really want to get.

EDIT:

Well reading your updates (which appeared between my starting of writing this answer and its publication), it looks like I was somewhat right (for the Windows and the Android). However, I was just wrong for clock()., but still, using a common timer like omp_get_wtime() on both platforms would be a good idea.

Nevertheless, I don't believe any of what I proposed so far can explain the discrepancy between the speed-ups seen on both machines. I suspect that the bottom line might just be the hardware characteristics. Again, this is very speculative (especially considering I never tried to run anything on a smartphone), but this could just be the consequence of (very) different balances between memory bandwidth, cache sizes and CPU performances on both machines:

The speed-up you get on PC (~5x for 8 cores) is good but not perfect. This means that you're already reaching a bottleneck there, which is likely either the memory bandwidth, or the cache size. This means that your code is likely to be sensitive to these hardware parameters.
The speed-up measured on the smartphone show very little improvement over the sequential code. Well, I would (naïvely maybe) expect the cache sizes and memory bandwidth on a phone's CPU to be far smaller than on a PC. Even just looking at the ratio between CPU peak performance and cache size and/or memory bandwidth, I would expect a high-end PC CPU to be much better balanced than a smartphone CPU. And if this is the case, considering the high-end PC CPU is already pushed to its scalability limits by your code, it's quite normal that the phone's CPU reaches its limits even earlier.

A good way for you to evaluate whether this is indeed the case would be to use a roofline model of both platforms and to compute the arithmetic intensity of your algorithm for plotting it on the graphs. This will give you a fair idea on how well you perform and if you have any headroom for further improvement.

How to parallelize this recursive algorithm

Answers (1)

Related Questions