Huy Le
Huy Le

Reputation: 1738

Does phone CPU have separate integer and floating point compute units that can operate in parallel?

On desktop CPU, interleaved integer and float computation (such as with float arrays: updating integer indexes while computing the array value) is faster than all integer compute then all float compute. This is because integer ops and float ops are processed by different parts of the CPU, so they can be processed at the basically same time.

Is it the same for newer phones' CPU and ARM architecture in general?

Upvotes: 0

Views: 338

Answers (1)

fcdt
fcdt

Reputation: 2503

After the x86 architecture has already been discussed in the comments, now about ARM:

Basically, this also depends on the processor model used. Most ARM processors have only two pipelines for SIMD calculations. Some instructions can only be executed on one of the two pipelines, but most do not care. This also applies to simple ALU operations such as

  • FADD, FSUB, FMUL for floating-point SIMD
  • ADD, SUB, MUL for integer SIMD

If this addition, for example, already has a throughput of (maximum) 2 instructions per cycle, this means that both pipelines are fully utilized. So here simple integer instructions are just as fast as floating point instructions. Due to the high throughput, no speed advantage can be achieved by using the pipelines for SIMD or even SISD integer operations instead. Here I assume, of course, that there are no dependencies between the instructions.

In addition to the throughput, the latency of the instructions must also be taken into account: The integer SIMD ADD has a maximum latency of 3 cycles, for floating-point FADD it is 4 cycles. On the other hand, the non-SIMD add only has one cycle latency. The latency indicates the number of cycles after which the result is available at the earliest. If the following instruction is based on the result of the previous one, the throughput is limited and it can be useful to put other instructions in between that use other pipelines, for example the non-SIMD ALU one.

At least that's the case with the Cortex-A72 and Cortex-A76. With the older Cortex-A55 it's a bit more complicated. You can find information in the respective "Software Optimization Guide", for example:

Clarification after some comments: Scalar operations on SIMD registers (using s0 to s31, d0 to d31 etc.) and vector operations on them (v0 to v31) always take place on the two SIMD pipelines. Only operations on general-purpose registers (w0 to w30, wzr, wsp, x0 to x31, xzr, xsp) run on the two non-SIMD ALU pipelines I0/I1 and the M-Pipeline. That's why, in some cases, one ALU pipeline I0/I1 is also used for address calculation with SIMD instructions.

Upvotes: 2

Related Questions