Accelerate framework uses only one core on Mac M1

Question

The following C program (dgesv_ex.c)

#include 
#include 

/* DGESV prototype */
extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv,
                double* b, int* ldb, int* info );

/* Main program */
int main() {
        /* Locals */
        int n = 10000, info;
        /* Local arrays */
        /* Initialization */
        double *a = malloc(n*n*sizeof(double));
        double *b = malloc(n*n*sizeof(double));
        int *ipiv = malloc(n*sizeof(int));
        for (int i = 0; i < n*n; i++ )
        {
                a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
        }
        for(int i=0;i


compiled on a Mac mini M1 with the command
clang -o dgesv_ex dgesv_ex.c -framework accelerate

uses only one core of the processor (as also shown by the activity monitor)
me@macmini-M1 ~ % time ./dgesv_ex 
./dgesv_ex  35,54s user 0,27s system 100% cpu 35,758 total

I checked that the binary is of the right type:
me@macmini-M1 ~  % lipo -info dgesv
Non-fat file: dgesv is architecture: arm64

As a comparaison, on my Intel MacBook Pro I get the following output :
me@macbook-intel ˜ % time ./dgesv_ex
./dgesv_ex  142.69s user 0,51s system 718% cpu 19.925 total

Is it a known problem ? Maybe a compilation flag or else ?

Maynard Handley · Accepted Answer

The original poster and the commenter are both somewhat unclear on exactly how AMX operates. That's OK, it's not obvious! For pre-A15 designs the setup is:

(a) Each cluster (P or E) has ONE AMX unit. You can think of it as being more an attachment of the L2 than of a particular core. (b) This unit has four sets of registers, one for each core. (c) An AMX unit gets its instructions from the CPU (sent down the Load/Store pipeline, but converted at some point to a transaction that is sent to the L2 and so the AMX unit).

Consequences of this include that

AMX instructions execute out of order on the core just like other instructions, interleaved with other instructions, and the CPU will do all the other sort of overhead you might expect (counter increments, maybe walking and derefencing sparse vectors/matrices) in parallel with AMX. A core that is running a stream of AMX instructions will look like a 100% utilized core. Because it is! (100% doesn't mean every cycle the CPU is executing at full width; it means the CPU never gives up any time to the OS for whatever reason).
ideally data for AMX is present in L2. If present in L1, you lose a cycle or three in the transfer to L2 before AMX can access it.
(most important for this question) there is no value in having multiple cores running AMX code to solve a single problem. They will all land up fighting over the same single AMX unit anyway! So why complicate the code with Altivec by trying to achieve that. It will work (because of the abstraction of 4 sets of registers) but that's there to help "un-co-ordinated" code from different apps to work without forcing some sort of synchronization/allocation of the resource.
the AMX unit on the E-cluster does work, so why not use it? Well, it runs at a lower frequency and a different design with much less parallelization. So it can be used by code that, for whatever reason, both runs on the E-core and wants AMX. But trying to use that AMX unit along with the P AMX-unit is probably more trouble than it's worth. The speed differences are large enough to make it very difficult to ensure synchronization and appropriate balancing between the much faster P and the much slower E. I can't blame Apple for considering pursuing this a waste of time.

More details can be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

It is certainly possible that Apple could change various aspects of this at any time, for example adding two AMX units to the P-cluster. Presumably when this happens, Accelerate will be updated appropriately.

Accelerate framework uses only one core on Mac M1

Answers (2)

Related Questions