Hopefully Quick Parallel Optimal Lapack Routine (gfortran) Questions

Question

I thought I had a very clear understanding of this until two days ago, but now I might be over thinking it and confusing myself. I'll explain what I'm doing and then ask a couple of probably simplistic questions, but I've searched and found conflicting answers thus far. Surely someone can set me straight.

I have written a fortran code that utilizes a LAPACK routine to solve an eigenvalue problem. My problem set up is (A-LB)x=0, where L is my eigenvalue, x is my eigenvector(s), and A and B are square, complex, non-symmetric, non-hermitian, non-triangular matrices. A and B are both NxN, N in my code will typically be between 1000 and 3000.

Right now the code works perfectly. I'm using an optimized atlas install with LAPACK. I'm specifically running routine ZGGEV (link) because, for now, I need ALL eigenvalue solutions and ALL associated eigenvector solutions.

Now I'm trying to optimize my code to run faster. All of the computers in our lab contain 4 or 8 core CPUs and run on Ubuntu. Is there anything I can do to utilize my full cpu when solving this problem? I've been looking into it the following things:

I installed an optimized OpenBlas library and it is definitely faster, but I notice it still uses only 1 core (there's a small spike where it uses more. I assume this spike is the BLAS package running in parallel and LAPACK is limited to one core?)
I've investigated PLASMA, but it doesn't look like it will solve my equation in its current form.
I've looked into ScaLAPACK, but this is over my head at the moment and I'm not sure it's worth learning to utilize on an 8 core CPU. Furthermore, I use openmp threading for a later section of my code and I've never combined openmp with MPI.

Finally, I have a few specific Blas questions:

Atlas comes with "libptcblas" and "libptf77blas" libraries. These are supposed to be threaded libraries, but I'm not noticing a difference when I use them, in fact it runs a little slower (I guess due to overhead). Is there a call I need to make to utilize these? Is there reason for me to use these libraries over "libcblas" and "libf77blas?"
With OpenBlas, it also built a very specific "libopenblas_penrynp-r0.2.12." Is this the threaded version? Again I don't notice any difference running this blas versus running "libopenblas".

Hopefully someone can clear up some of my Blas questions and point me toward a faster solution method. Thanks!

ztik · Accepted Answer

You are correct expecting multi-threaded behavior mainly from BLAS and not LAPACK routines. The size of the matrices is big enough to utilize multi-threaded environment. I am not sure about the extend of BLAS usage in ZGGEV routine, but it should be more than a spike.

Regarding your specific questions.

Even though I have not used ATLAS library extensively, it is known that "the number of threads to use is determined at compile time". Please refer to http://math-atlas.sourceforge.net/faq.html#tnum .
The specific libopenblas_*.a is a copy or soft link of the libopenblas.a. The thread number is defined again at compile time.

Please check the log files and std.out from the library builds and verify that they have identified the correct number of CPUs.

I noticed that you mentioned, more than one machines. Note that ATLAS is an automatically tuned library. So you have to recompile the library in each machine. On the other hand Openblas accepts DYNAMIC_ARCH=1 option in make. This library dynamically specify the optimize routines in each machine.

My suggestion for your multi-threaded test is to build Openblas using

$ make DYNAMIC_ARCH=1 NUM_THREADS=8

Then CALL ZGEMM in your program. This is routine is definitely optimize and should show multi-threaded behavior.

Hopefully Quick Parallel Optimal Lapack Routine (gfortran) Questions

Answers (1)

Related Questions