gcc on raspberry half precision floating point (binary16, alternative, __fp16) uses library function

Question

I some machine learning based algorihm on a raspberry pi 3 with huge arrays of stored coefficients, that do not need full float32 precision.

I tried to use half precision floating point for storing this data to reduce the programs memory (and maybe memory bandwidth) footprint.

The rest of the algorithm stays the same.

Comparing the float32 with the float16 version i got a (significant: + 33% runtime of my test program) performance loss when using __fp16, although the conversion should be supported by the cpu.

I took a look into the asembler output and also created a sinple function that just reads a __fp16 value and returns a it as float and it seems that some library function call is used for the the conversion. (the same function is called than in the actual code)

The rapspberry's cpu should have half precision hardware support, so I expected to see some instruction loading the data and to not see any performance-impact (or see improvement due to reduced memory bandwidth requiremenents)

I am using the following compiler-flags:

-O3 -mfp16-format=alternative -mfpu=neon-fp16 -mtune=cortex-a53 -mfpu=neon

here the small piece of code and the assembler outputs for the little test function:

const float test(const Coeff *i_data, int i ){
  return (float)(i_data[i]);
}

using float for Coeff:

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    add r1, r0, r1, lsl #2  @ tmp118, i_data, i,
    vldr.32 s0, [r1]    @, *_5
    bx  lr  @

using __fp16 for Coeff (-mfp16-format=alternative):

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    lsl r1, r1, #1  @ tmp118, i,
    push    {r4, lr}    @
    ldrh    r0, [r0, r1]    @ __fp16    @, *_5
    bl  __gnu_h2f_alternative   @
    vmov    s0, r0  @,
    pop {r4, pc}    @

using __fp16 for Coeff (-mfp16-format=ieee):

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    lsl r1, r1, #1  @ tmp118, i,
    push    {r4, lr}    @
    ldrh    r0, [r0, r1]    @ __fp16    @, *_5
    bl  __gnu_h2f_ieee  @
    vmov    s0, r0  @,
    pop {r4, pc}    @

Have I missed something?

vlad_tepesch · Accepted Answer

The compiler flag -mfpu=neon overrides the earlier -mfpu=neon-fp16 since -mfpu= can only be specified once.

It was a mistake that it was set twice (it was added in a different place in the Makefile).

But since the raspberry 3 has a vfpv4 that always has fp16 support, the best specification is -mfpu=neon-vfpv4.

In this case no library calls are generated by the compiler for the conversion.

edit: according to this ghist -mfpu=neon-fp-armv8 -mneon-for-64bits can be used for Raspberry 3.

gcc on raspberry half precision floating point (binary16, alternative, __fp16) uses library function

Answers (2)

Related Questions