Reputation: 2135
In an iOS 6 project, I have a buffer containing two byte words (16 bits) that need to be translated to four byte words (32 bits) via a lookup table. I hard-code the values into the table, and then use the the value of the two byte buffer to determine which 32 bit table value to retrieve. Here's an example:
void map_values(uint32_t *dst,uint16_t *src,uint32_t *lut,int buf_length){
int i=0;
for(i=0;i<buf_length;i++){
*dst = *(lut+(*src));
dst++;
src++;
}
}
The problem is, it's too slow. Could this be sped up by processing 4 output bytes at a time using NEON? The thing is, I'm iffy on how to take the value from the src buffer and use that as an input to the lookup table to figure out what value to retrieve. Also, the word lengths are the same in the table and the output buffer, but not for the source. So, I can only read two 16 bit words as input, versus the four 32 bit word output I need. Any ideas? Is there a better way to approach this problem, perhaps?
Current asm output from clang (clang -O3 -arch armv7 lut.c -S):
.section __TEXT,__text,regular,pure_instructions
.section __TEXT,__textcoal_nt,coalesced,pure_instructions
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.section __TEXT,__StaticInit,regular,pure_instructions
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _map_values
.align 2
.code 16 @ @map_values
.thumb_func _map_values
_map_values:
@ BB#0:
cmp r3, #0
it eq
bxeq lr
LBB0_1: @ %.lr.ph
@ =>This Inner Loop Header: Depth=1
ldrh r9, [r1], #2
subs r3, #1
ldr.w r9, [r2, r9, lsl #2]
str r9, [r0], #4
bne LBB0_1
@ BB#2: @ %._crit_edge
bx lr
.subsections_via_symbols
Upvotes: 1
Views: 2016
Reputation: 106287
Lookup tables are (nearly) unvectorizable. Very small lookup tables can be handled with the vtbl
instruction, but your lookup table is far too big for that.
What are you using the lookup table for? If the values can be computed on the fly without too much work instead of looking them up, that may actually be a significant win for you.
Upvotes: 3
Reputation: 49386
My first thought is that you might get some luck out of vtablelookup
in the vecLib portion of the Accelerate framework. The signature is:
vUInt32 vtablelookup (
vSInt32 Index_Vect,
uint32_t *Table
);
where vSInt32
and vUInt32
are 128 bit packed 32 bit signed/unsigned integers respectively. I believe the function is backed by NEON on ARM. The big problem will be converting your src
array into 32 bit indices, which could well slow things down so much as to render the speed gains from vectorising the lookup pointless.
Upvotes: 1