Brent Bradburn
Brent Bradburn

Reputation: 54859

Best-case instruction throughput on ARM NEON

What is the best-case instruction throughput for a compute-bound algorithm coded in ARM-NEON?

For example, if I have a simple algorithm based on a large number of 8-bit->8-bit operations, what is the fastest possible execution speed (measured in 8-bit operations per cycle) that could be sustained if we assume full latency hiding of any memory I/O.

I am initially interested in Cortex-A8, but if you also have data for different processors, please note the differences.

Upvotes: 3

Views: 2821

Answers (2)

Exophase
Exophase

Reputation: 726

Most integer operations on Cortex-A8's NEON unit are executed 128-bits at a time, not 64-bits. You can see the throughput in the TRM, found here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/index.html Some notable exceptions include multiplications, shift by register value, and bit selects. But if you think about it, if there weren't 128-bit integer operations there'd be a lot less reason to use these instructions, since Cortex-A8 can already execute two 32-bit scalar integer operations in parallel.

Sadly, Cortex-A8 and A9 were the last ARM cores to include public documentation of execution performance. I haven't done extensive testing, but I think A15 can execute a 128-bit and 64-bit NEON operation in parallel (not sure what restrictions there are). And from what I've heard in passing - this is totally untested - both Cortex-A5 and A7 have 64-bit NEON execution. A5 is further limited by only having 32-bit NEON load/store throughput (while A8 actually has 128-bit, and A9 and A7 have 64-bit)

Upvotes: 1

Peter M
Peter M

Reputation: 1988

As nobar mentioned, this will vary depending on micro-architecture (Samsung/Apple/Qualcomm) etc. But basically (stock A8 implementation) NEON is a 64 bit architecture with two (or one) 64 bit operands giving a 64 bit result. So without any pipeline (data dependency) stalls or I/O stalls, an integer pipeline can do 8, 8-bit operations per cycle in SIMD fashion. So the best case on stock arm processors that are single issue for ALU/Mult operations is probably "8."

You can look at the ARM architecture reference for an idea of how long various instructions take on stock ARM A8 processors. If you aren't familiar with the nomenclature, "D" registers are 64 bit, "Q" are double wide 128 bit registers, and instructions can treat the data in the registers as 8,16 or 32 bit formats.

A nice overview of a stock A8 architecture is via TI's A8 NEON Architecture page.

Specifically about the differences between processors, a lot of ARM implementers don't make their architecture details known except for extremely powerful customers, so noting the differences is fairly difficult but as Stephen Canon notes below, the newer higher end A15-ish ones will probably double the performance for some types of instructions, and lower power ones will probably halve it for some types of instructions.

Upvotes: 1

Related Questions