user1296153
user1296153

Reputation: 585

VLD3.8 in NEON ASM doesn't work like the vld3q_u8 documentation says it should?

According to the ARM reference we have 2 functions to load 8 and 16 uint8_t instances respectively:

 uint8x16x3_t  vld3q_u8(__transfersize(48) uint8_t const * ptr);  
                                             // VLD3.8 {d0, d2, d4}, [r0]

 uint8x8x3_t vld3_u8(__transfersize(24) uint8_t const * ptr);  
                                             // VLD3.8 {d0, d1, d2}, [r0]

In NEON intrinsics I tried vld3q_u8 and everything worked successfully that 16 * 3 of uint8 elements was loaded; however, when I used VLD3.8 {d0, d2, d4} in NEON assembly only 8 * 3 of uint8 elements was loaded.

It seems to me that d1, d3 and d5 registers weren't used.

I would like to use q0(d0, d1), q1(d2, d3), and q3(d4, d5) registers fully to load 16 * 3 of uint8 elements.

Could anyone help ?

//sample code:
vld3.8 {d0, d2, d4}, [%[A]]!
vst.3.8 {d0, d2, d4}, [%[C]]!

I am building this for a 32bit ARM architecture.

Upvotes: 1

Views: 715

Answers (1)

Notlikethat
Notlikethat

Reputation: 20924

It seems to me that d1, d3 and d5 registers weren't used

Indeed they're not, unless you load them. What the intrinsics reference isn't very clear about is that the Q-form load/store intrinsics expand to two instructions each - the underlying vldn/vstn instructions only target D registers, but can do so either consecutively, or with a stride of 2, such that a pair of instructions can load pairs of registers in the appropriate order.

Here's a disassembled example of what a vld3q_u8 intrinsic actually looks like in-situ:

0:   f460650f        vld3.8  {d22,d24,d26}, [r0]
4:   e2802018        add     r2, r0, #24
...
c:   f462750f        vld3.8  {d23,d25,d27}, [r2]
...

That's targeting a uint8x16x3_t variable for which the compiler has apparently allocated Q11-13.

Upvotes: 3

Related Questions