Reputation: 6891
I some machine learning based algorihm on a raspberry pi 3 with huge arrays of stored coefficients, that do not need full float32 precision.
I tried to use half precision floating point for storing this data to reduce the programs memory (and maybe memory bandwidth) footprint.
The rest of the algorithm stays the same.
Comparing the float32 with the float16 version i got a (significant: + 33% runtime of my test program) performance loss when using __fp16
, although the conversion should be supported by the cpu.
I took a look into the asembler output and also created a sinple function that just reads a __fp16
value and returns a it as float
and it seems that some library function call is used for the the conversion. (the same function is called than in the actual code)
The rapspberry's cpu should have half precision hardware support, so I expected to see some instruction loading the data and to not see any performance-impact (or see improvement due to reduced memory bandwidth requiremenents)
I am using the following compiler-flags:
-O3 -mfp16-format=alternative -mfpu=neon-fp16 -mtune=cortex-a53 -mfpu=neon
here the small piece of code and the assembler outputs for the little test function:
const float test(const Coeff *i_data, int i ){
return (float)(i_data[i]);
}
using float
for Coeff
:
.align 2
.global test
.syntax unified
.arm
.fpu neon
.type test, %function
test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r1, r0, r1, lsl #2 @ tmp118, i_data, i,
vldr.32 s0, [r1] @, *_5
bx lr @
using __fp16
for Coeff
(-mfp16-format=alternative
):
.align 2
.global test
.syntax unified
.arm
.fpu neon
.type test, %function
test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
lsl r1, r1, #1 @ tmp118, i,
push {r4, lr} @
ldrh r0, [r0, r1] @ __fp16 @, *_5
bl __gnu_h2f_alternative @
vmov s0, r0 @,
pop {r4, pc} @
using __fp16
for Coeff
(-mfp16-format=ieee
):
.align 2
.global test
.syntax unified
.arm
.fpu neon
.type test, %function
test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
lsl r1, r1, #1 @ tmp118, i,
push {r4, lr} @
ldrh r0, [r0, r1] @ __fp16 @, *_5
bl __gnu_h2f_ieee @
vmov s0, r0 @,
pop {r4, pc} @
Have I missed something?
Upvotes: 0
Views: 1320
Reputation: 6891
The compiler flag -mfpu=neon
overrides the earlier -mfpu=neon-fp16
since -mfpu=
can only be specified once.
It was a mistake that it was set twice (it was added in a different place in the Makefile).
But since the raspberry 3 has a vfpv4 that always has fp16 support, the best specification is -mfpu=neon-vfpv4
.
In this case no library calls are generated by the compiler for the conversion.
edit: according to this ghist
-mfpu=neon-fp-armv8 -mneon-for-64bits
can be used for Raspberry 3.
Upvotes: 2
Reputation: 6354
On ARM's site: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0774d/chr1421838476257.html
Note The __fp16 type is a storage format only. For purposes of arithmetic and other operations, __fp16 values in C or C++ expressions are automatically promoted to float.
Upvotes: 1