A23149577
A23149577

Reputation: 2145

AArch64 - Running ARM and ASIMD instructions in parallel

I want to implement a code in assembly instruction using both ARM assembly instruction and ASIMD instructions in parallel. My first question is, whether this is can be done on ARMv8? Based on this thread, it's possible on ARMv7, however data transfer between NEON and ARM registers takes considerable amount of time. Second, I am looking for a way that I can implement my assembly code in parallel. Here is what I am trying to do:

.
.
.
<ASIMD instruction>
<ASIMD instruction>
<ASIMD instruction>
<Data MOV between ASIMD vectors and ARM Reg>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<Data MOV between ARM Reg and ASIMD vectors>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
.
.
.

I am wondering if I can do this using two threads. I am working on ARM-CortexA53 microprocessor. I also have access to ARM-CortexA57, but I think these platforms are roughly the same and they have equal capabilities.

Upvotes: 2

Views: 3765

Answers (2)

Dric512
Dric512

Reputation: 3729

I am not sure what you mean with "In parallel". None of Cortex-A53 or Cortex-A57 support multithreading (Although it is possible to have several CPUs in the same chip, which is a different matter).

What you can do however on Cortex-A57 (Certainly less on A53) is to use the fact that execution is mostly out-of-order. So it you don't have dependencies between the instructions, the long instruction can execute, and during this time, you could execute the shorter instructions. But really using it is very difficult, and the best may be to trust that the CPU will do as much out-or-order execution as it can.

Upvotes: 2

James Greenhalgh
James Greenhalgh

Reputation: 2491

I think your comments on threading are misplaced here, or you have a background in a hyper-threaded (or other simultaneous multithreading) architecture. Neither Cortex-A57 or Cortex-A53 are SMT microarchitectures, so at any time you will only have one thread executing on one core. This means your idea of having one thread for Advanced SIMD instructions and one thread for integer/A32/T32 (what you call "ARM instructions") instructions is not going to result in good overall utilisation of a multi-core system.

The thread you linked to discusses a model for the Cortex-A8 microarchitecture in which data dependencies carried through Neon instructions back to A32 instructions cause pipeline bubbles (note that the other comment saying this has to do with memories being synced is incorrect). While it is the case that there is some cost to moving data from Advanced SIMD registers to core registers, the cost is much lower than that thread suggests (see, for example, the Cortex-A57 Software Optimisation Guide, which gives latency numbers for each instruction).

The performance benefits you gain from making use of the vectorised Advanced SIMD instructions will depend on the blend of instructions you intend to use in the A32 and Advanced SIMD portions of your algorithm. Moving the data around too often will have the obvious impact on your execution speed - the more time you spend moving data, the less time you are spending doing the work you intend to do!

The instruction interleaving you propose above is a common way to expose instruction level parallelism, and is likely to work well within a single thread.

Upvotes: 4

Related Questions