Qqwy
Qqwy

Reputation: 5629

How to find the matching element in a tiny array with unique elements as quickly as possible?

Inside the Erlang runtime system, persistent hashmaps are represented as hash-array-mapped-tries if they are big, and 'flatmaps' if they are small.

I recently was nerdsniped into looking for ways to optimize this. ^_^'

A flatmap has the following characteristics:

The current implementation of this is:

uint64_t *original_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
  uint64_t n = max_size;
  uint64_t i;

  for (i = 0; i < n; ++i) {
    if (keys[i] == key) {
      return &vals[i];
    }
  }
  return NULL;
}

(Simplified from the original)

But this does not use above info at all. I tried what happened if the compiler was made aware that

This lead to the following implementation:

uint64_t *latereturn_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
  uint64_t n = min(max_size, 32);
  uint64_t i;

  uint64_t *res = NULL;
  for (i = 0; i < n; ++i) {
    if (keys[i] == key) {
      res = &vals[i];
    }
  }
  return res;
}

Looking at Compiler Explorer we can see that Clang and GCC are able to vectorize and unroll the loop now. Benchmarking this shows a 5-15% speedup.


However, now for the question: Is it possible to go further?

For instance, is it possible to indicate to the compiler somehow that all elements in the array will be unique which might enable even more optimizations?

Or are there maybe ways to manually write some SIMD instructions directly which are even faster?

Upvotes: 2

Views: 338

Answers (3)

amonakov
amonakov

Reputation: 2409

At 32 elements max, the amount of CPU cycles to issue the entire loop is comparable to pipeline depth and branch misprediction penalty, so vectorized solutions need to be careful about poorly predictable branches in their prologues and epilogues. A microbenchmark that repeatedly invokes the function with the same array size is prone to hiding this aspect, making a size-dependent branch perfectly predictable where it might not be in real usage.

I expect the following scalar loop to be preferable to vectorized solutions in actual usage:

#include <stdint.h>

#define unlikely(cond) __builtin_expect((cond), 0)

uint64_t *
scal_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
    if (!n) return 0;

    uint64_t *last = &keys[n-1];
    for (uint64_t *k = keys; k < last; k += 2) {
#if defined(__aarch64__)
        // Show the compiler that k[1] can be loaded together with k[0]
        // so they can be combined in a load-pair (ldp) instruction
        uint64_t k1 = k[1], *pk1 = &k1;
#else
        uint64_t *pk1 = &k[1];
#endif
        if (unlikely(k[0] == key)) {
            // Prevent hoisting below computations in the loop
            asm("" : "+r"(k));
            return &vals[k-keys];
        }
        if (unlikely(*pk1 == key)) {
            asm("" : "+r"(k));
            return &vals[k+1-keys];
        }
    }
    return *last == key ? &vals[n-1] : 0;
}

On x86-64 the loop comprises the following instructions (view on Compiler Explorer):

.L6:
        cmp     QWORD PTR [rax], rdx
        je      .L11
        cmp     QWORD PTR [rax+8], rdx
        je      .L12
        add     rax, 16
        cmp     rax, rcx
        jb      .L6

On popular x86 implementations one iteration of this loop can be issued in one cycle. Execution port(s) at which branch instructions can execute is the bottleneck, limiting it to between 1.5 and 3 cycles per two elements.


Alternatives that may look better in microbenchmarks:

The greedy switch

This one shines when the initial branch that depends on n is perfectly predictable.

uint64_t *
switch_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
  switch (n-1) {
#define CASE(N) \
    case N: if (keys[N] == key) return &vals[N];
    CASE(31) CASE(30) CASE(29) CASE(28) CASE(27) CASE(26) CASE(25) CASE(24)
    CASE(23) CASE(22) CASE(21) CASE(20) CASE(19) CASE(18) CASE(17) CASE(16)
    CASE(15) CASE(14) CASE(13) CASE(12) CASE(11) CASE(10) CASE(9)  CASE(8)
    CASE(7)  CASE(6)  CASE(5)  CASE(4)  CASE(3)  CASE(2)  CASE(1)  CASE(0)
    default: return 0;
  }
}

GCC generic vectors showcase

Same source compiles to reasonable AVX and ARMv8 NEON assembly (view on Compiler Explorer):

// Poor man's 'branchless select'
static uintptr_t sel(int cond, uintptr_t p, uintptr_t q)
{
  asm("" : "+r"(p), "+r"(q));
  q = cond ? p : q;
  asm("" : "+r"(p), "+r"(q));
  return q;
}

uint64_t *
vec_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
  if (!n) return 0;

  uintptr_t res = sel(keys[n-1] == key, (uintptr_t)&vals[n-1],  0);

  if (n == 1) return (void *)res;

  res = sel(keys[(n|1)-2] == key, (uintptr_t)&vals[(n|1)-2], res);
  res = sel(keys[(n|1)-3] == key, (uintptr_t)&vals[(n|1)-3], res);

  typedef uint64_t u64v2 __attribute__((vector_size(16)));
  typedef u64v2 u64v2_u  __attribute__((aligned(8),may_alias));

  u64v2_u *vkeys = (void *)keys;
  u64v2 vvals01 = { 0,  8  }; vvals01 += (uintptr_t)vals;
  u64v2 vvals23 = { 16, 24 }; vvals23 += (uintptr_t)vals;
  u64v2 vkey = { key, key };
  u64v2 vres = { 0 };

  for (; vkeys != (void*)&keys[n&-4]; vkeys += 2) {
    u64v2 v01 = vkeys[0] == vkey, v23 = vkeys[1] == vkey;
    vres |= vvals01 & v01;
    vres |= vvals23 & v23;
    vvals01 += 32;
    vvals23 += 32;
  }

  res |= vres[0] | vres[1];

  return (void *)res;
}

Upvotes: 2

Fra93
Fra93

Reputation: 2082

I have an idea that goes this way:

The idea here is to get rid of the branch itself, which could, due to branch prediction on all newer (since 15 years) processors waste a lot of cycles and bump the performance of this function.

What I want is something that get executed on all keys, such that the result is mixed all together, but will still give the indication of where our flag is.

So the pseudo-idea-code is

res = 0 // Init to some neutral value
res = res <op> f(keys,key) // do an operation to mix the results with something that is function of the "search" method

Elaborating further:

  • I know that when I do the bitwise XOR between a number and itself I get all zeros, while in all other cases I get a non zero value.
  • I also know that XORing with 1 toggles a bit
  • Processors have also an instruction to detect the number of ones, so I can use that to reduce every non-zero number to a 1, without using any if.
  • The assumption is that running 32 operations every time is faster than checking the conditional code
  • Another assumption is that bitwise operations are easily convertible into SIMD instructions and are faster on any processor.

Something that has these properties is the following:

1^reduce_or(keys[i]^key) << i 

keys[i]^key -> gives 0 if key matches, a random number otherwise
reduce_or   -> gives 0 if key matches, 1 otherwise
1^          -> gives 1 if key matches, 0 otherwise
<< i        -> moves the 1 to the bit position at which the key was matched

So the final idea:

res  = 0 
res = res | reduce_or(keys[0]^key) << 0; 
res = res | reduce_or(keys[1]^key) << 1; 
res = res | reduce_or(keys[2]^key) << 2; 
res = res | reduce_or(keys[3]^key) << 3; 
res = res | reduce_or(keys[4]^key) << 4; 
res = res | reduce_or(keys[5]^key) << 5; 
..; 
res = res | reduce_or(keys[31]^key) << 31; 

After this pass we should have a number like 000000100000000, and the one is at the index of which the key was found.

We still need to:

  • Get an integer from the one-hot encoded number.
  • Get the address.

To pass from the one hot encoded number to the position, it is just log2 of the result. However, this could be slow. I don't really have a solution for this. Maybe we can do, instead of shifting by i, multiply by ì, this will give us a 0 when the reduce_or is 0 and i, that is, the index already, when it is the right key.

 res  = 0 
 res = res + reduce_or(keys[0]^key) * 0; 
 res = res + reduce_or(keys[1]^key) * 1; 
 res = res + reduce_or(keys[2]^key) * 2; 
 res = res + reduce_or(keys[3]^key) * 3; 
 res = res + reduce_or(keys[4]^key) * 4; 
 res = res + reduce_or(keys[5]^key) * 5; 
 ..; 
 res = res + reduce_or(keys[31]^key) * 31; 

We should test if doing the log is faster than doing sums and multiplications.

To get the address is just a matter of pointer arithmetic:

addr = vals+res

Anyway, this should give a branchless code :) I am curious to see if it will be faster or not!

Upvotes: 2

Soonts
Soonts

Reputation: 21936

I’m not sure how faster it gonna get if at all, but here’s manually vectorized AVX2 version of your function.

uint64_t* flatmap_avx2( const uint64_t* keys, uint64_t* vals, uint64_t key, uint64_t max_size )
{
    const __m256i needle = _mm256_set1_epi64x( (int64_t)key );

    const uint64_t* const keysEnd = keys + max_size;
    const uint64_t* const keysEndAligned = keys + ( max_size / 4 ) * 4;

    for( ; keys < keysEndAligned; keys += 4, vals += 4 )
    {
        __m256i src = _mm256_loadu_si256( ( const __m256i* )keys );
        __m256i eq = _mm256_cmpeq_epi64( needle, src );
        uint32_t mask = (uint32_t)_mm256_movemask_epi8( eq );
        if( 0 == mask )
            continue;
        uint32_t byteIndex = _tzcnt_u32( mask );
        // The index is multiple of 8, in assembly all addresses expressed in bytes,
        // yet adding pointers in C adds elements not bytes, that's why casting
        return (uint64_t*)( ( (uint8_t*)vals ) + byteIndex );
    }

    for( ; keys < keysEnd; keys++, vals++ )
        if( *keys == key )
            return vals;

    return nullptr;
}

If you're building this with VC++, ideally add #pragma loop( no_vector ) before the second for loop in the function.

Similarly, if you’re building with gcc or clang, ideally add __attribute__((optimize("no-tree-vectorize"))) before the whole function.

Without these compiler-specific shenanigans, compilers may decide to automatically vectorize the second for loop with the remainder, inflating the code for no good reason.

Another performance-related thing. If you can, align your keys pointer by 32 bytes, will become slightly faster.

Upvotes: 5

Related Questions