How to implement 16 bit-32 bit lookup table in ARM assembly using NEON?

Question

In an iOS 6 project, I have a buffer containing two byte words (16 bits) that need to be translated to four byte words (32 bits) via a lookup table. I hard-code the values into the table, and then use the the value of the two byte buffer to determine which 32 bit table value to retrieve. Here's an example:

void map_values(uint32_t *dst,uint16_t *src,uint32_t *lut,int buf_length){
    int i=0;
    for(i=0;i



The problem is, it's too slow. Could this be sped up by processing 4 output bytes at a time using NEON? The thing is, I'm iffy on how to take the value from the src buffer and use that as an input to the lookup table to figure out what value to retrieve. Also, the word lengths are the same in the table and the output buffer, but not for the source. So, I can only read two 16 bit words as input, versus the four 32 bit word output I need. Any ideas? Is there a better way to approach this problem, perhaps?

Current asm output from clang (clang -O3 -arch armv7 lut.c -S):

    .section    __TEXT,__text,regular,pure_instructions
    .section    __TEXT,__textcoal_nt,coalesced,pure_instructions
    .section    __TEXT,__const_coal,coalesced
    .section    __TEXT,__picsymbolstub4,symbol_stubs,none,16
    .section    __TEXT,__StaticInit,regular,pure_instructions
    .syntax unified
    .section    __TEXT,__text,regular,pure_instructions
    .globl  _map_values
    .align  2
    .code   16                      @ @map_values
    .thumb_func _map_values
_map_values:
@ BB#0:
    cmp r3, #0
    it  eq
    bxeq    lr
LBB0_1:                                 @ %.lr.ph
                                        @ =>This Inner Loop Header: Depth=1
    ldrh    r9, [r1], #2
    subs    r3, #1
    ldr.w   r9, [r2, r9, lsl #2]
    str r9, [r0], #4
    bne LBB0_1
@ BB#2:                                 @ %._crit_edge
    bx  lr


.subsections_via_symbols

Stephen Canon · Accepted Answer

Lookup tables are (nearly) unvectorizable. Very small lookup tables can be handled with the vtbl instruction, but your lookup table is far too big for that.

What are you using the lookup table for? If the values can be computed on the fly without too much work instead of looking them up, that may actually be a significant win for you.

How to implement 16 bit->32 bit lookup table in ARM assembly using NEON?

Answers (2)

Related Questions

How to implement 16 bit-&gt;32 bit lookup table in ARM assembly using NEON?

Answers (2)

Related Questions

How to implement 16 bit->32 bit lookup table in ARM assembly using NEON?