VLD3.8 in NEON ASM doesn't work like the vld3q_u8 documentation says it should?

Question

According to the ARM reference we have 2 functions to load 8 and 16 uint8_t instances respectively:

 uint8x16x3_t  vld3q_u8(__transfersize(48) uint8_t const * ptr);  
                                             // VLD3.8 {d0, d2, d4}, [r0]

 uint8x8x3_t vld3_u8(__transfersize(24) uint8_t const * ptr);  
                                             // VLD3.8 {d0, d1, d2}, [r0]

In NEON intrinsics I tried vld3q_u8 and everything worked successfully that 16 * 3 of uint8 elements was loaded; however, when I used VLD3.8 {d0, d2, d4} in NEON assembly only 8 * 3 of uint8 elements was loaded.

It seems to me that d1, d3 and d5 registers weren't used.

I would like to use q0(d0, d1), q1(d2, d3), and q3(d4, d5) registers fully to load 16 * 3 of uint8 elements.

Could anyone help ?

//sample code:
vld3.8 {d0, d2, d4}, [%[A]]!
vst.3.8 {d0, d2, d4}, [%[C]]!

I am building this for a 32bit ARM architecture.

VLD3.8 in NEON ASM doesn't work like the vld3q_u8 documentation says it should?

Answers (1)

Related Questions

VLD3.8 in NEON ASM doesn&#39;t work like the vld3q_u8 documentation says it should?

Answers (1)

Related Questions

VLD3.8 in NEON ASM doesn't work like the vld3q_u8 documentation says it should?