Reputation: 818
Given a float matrix A[N,M] and a float array T[M], i would like to parallelize the following code using OpenMP:
for ( int i = 0; i < N; i++ )
{
T[0] = A[i][0];
for ( int j = 1; j < M; j++ )
T[j] = f(A[i][j]-T[j-1]) + A[i][j];
float v = T[M-1];
for ( int j = M-2; j >=0; j-- )
{
v = g(A[i][j]-v);
A[i][j] = T[j] + v;
}
}
where f and g are non linear functions. This is a recursive filter. My first attempt (i am new to OpenMP) was to parallelize the external loop and allocate T as a matrix of dimension #threads X M, where #threads=omp_get_max_threads():
#pragma omp parallel for
for ( int i = 0; i < N; i++ )
{
Let T1 point to the row of T indexed by the current thread (given by omp_get_thread_num())
T1[0] = A[i][0];
for ( int j = 1; j < M; j++ )
T1[j] = f(A[i][j]-T1[j-1]) + A[i][j];
float v = T1[M-1];
for ( int j = M-2; j >=0; j-- )
{
v = g(A[i][j]-v);
A[i][j] = T1[j] + v;
}
}
In this way, every thread uses its own T memory. I tested it on my 8-cores i7 CPU and the speedup is about 5x, but on my 4-cores smartphone CPU I get a really small speed up (like 1.3x - 1.5x). My questions are:
Update 1.
To test it, I used MS Visual Studio with OpenMP enabled on the 8-cores i7 machine, while on my smartphone I use the toolchain provided by Android NDK, using the -fopen -O3 -funroll-loops
flags. I use QueryPerformanceCounter
on Win, and the following code on Android:
///
double CUtil::GetTimeMs()
{
struct timespec res;
clock_gettime(CLOCK_REALTIME, &res);
return 1000.0 * res.tv_sec + (double) res.tv_nsec / 1e6;
}
Update 2. I have updated the speed up numbers on Android.
Upvotes: 0
Views: 721
Reputation: 9509
My feeling is that you had the right approach for the parallelisation.
However, since your T
array seems to be only used as temporary storage, instead of allocating it as a 2D array with first dimension being the number of threads to use, I would just allocate it inside the parallel
region as private. This shouldn't make much of a difference, but might simplify the job of ensuring data locality in a NUMA environment, and avoid any potential false sharing.
The code would look like this:
#pragma omp parallel
{
float *T = new float[M];
#pragma omp for schedule( static )
for ( int i = 0; i < N; i++ ) {
float *AA = A[i]; // this "optimisation" is arguable and can be ignored
T[0] = AA[0];
for ( int j = 1; j < M; j++ ) {
T[j] = f( AA[j] - T[j-1] ) + AA[j];
}
float v = T1[M-1];
for ( int j = M-2; j >= 0; j-- ) {
v = g( AA[j] - v );
AA[j] = T[j] + v;
}
}
delete[] T;
}
Now, whether it will make any difference performance-wise on your code, well I'm sceptical. However, I draw your attention to the use of clock()
as a timer for OpenMP code (and multi-threaded ones globally speaking). The point is that (on POSIX systems at least IINM) clock()
return the CPU time of the current thread and its children. Again, IINM, on Windows, the same function returns the CPU time of the calling thread only.
So if by any chance, your PC is on Windows and your mobile phone is on Android, on the former platform you print the CPU for one thread only while of the latter you print the cumulated CPU time of all threads...
This is very speculative idea, but in any case I couldn't encourage you enough to just drop clock()
and to use omp_get_wtime()
which returns the elapsed wall clock time, which is what you really want to get.
EDIT:
Well reading your updates (which appeared between my starting of writing this answer and its publication), it looks like I was somewhat right (for the Windows and the Android). However, I was just wrong for clock()
., but still, using a common timer like omp_get_wtime()
on both platforms would be a good idea.
Nevertheless, I don't believe any of what I proposed so far can explain the discrepancy between the speed-ups seen on both machines. I suspect that the bottom line might just be the hardware characteristics. Again, this is very speculative (especially considering I never tried to run anything on a smartphone), but this could just be the consequence of (very) different balances between memory bandwidth, cache sizes and CPU performances on both machines:
A good way for you to evaluate whether this is indeed the case would be to use a roofline model of both platforms and to compute the arithmetic intensity of your algorithm for plotting it on the graphs. This will give you a fair idea on how well you perform and if you have any headroom for further improvement.
Upvotes: 1