Reputation: 5629
Inside the Erlang runtime system, persistent hashmaps are represented as hash-array-mapped-tries if they are big, and 'flatmaps' if they are small.
I recently was nerdsniped into looking for ways to optimize this. ^_^'
A flatmap has the following characteristics:
uint64_t
's together to check for a match.The current implementation of this is:
uint64_t *original_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
uint64_t n = max_size;
uint64_t i;
for (i = 0; i < n; ++i) {
if (keys[i] == key) {
return &vals[i];
}
}
return NULL;
}
(Simplified from the original)
But this does not use above info at all. I tried what happened if the compiler was made aware that
This lead to the following implementation:
uint64_t *latereturn_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
uint64_t n = min(max_size, 32);
uint64_t i;
uint64_t *res = NULL;
for (i = 0; i < n; ++i) {
if (keys[i] == key) {
res = &vals[i];
}
}
return res;
}
Looking at Compiler Explorer we can see that Clang and GCC are able to vectorize and unroll the loop now. Benchmarking this shows a 5-15% speedup.
However, now for the question: Is it possible to go further?
For instance, is it possible to indicate to the compiler somehow that all elements in the array will be unique which might enable even more optimizations?
Or are there maybe ways to manually write some SIMD instructions directly which are even faster?
Upvotes: 2
Views: 338
Reputation: 2409
At 32 elements max, the amount of CPU cycles to issue the entire loop is comparable to pipeline depth and branch misprediction penalty, so vectorized solutions need to be careful about poorly predictable branches in their prologues and epilogues. A microbenchmark that repeatedly invokes the function with the same array size is prone to hiding this aspect, making a size-dependent branch perfectly predictable where it might not be in real usage.
I expect the following scalar loop to be preferable to vectorized solutions in actual usage:
#include <stdint.h>
#define unlikely(cond) __builtin_expect((cond), 0)
uint64_t *
scal_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
if (!n) return 0;
uint64_t *last = &keys[n-1];
for (uint64_t *k = keys; k < last; k += 2) {
#if defined(__aarch64__)
// Show the compiler that k[1] can be loaded together with k[0]
// so they can be combined in a load-pair (ldp) instruction
uint64_t k1 = k[1], *pk1 = &k1;
#else
uint64_t *pk1 = &k[1];
#endif
if (unlikely(k[0] == key)) {
// Prevent hoisting below computations in the loop
asm("" : "+r"(k));
return &vals[k-keys];
}
if (unlikely(*pk1 == key)) {
asm("" : "+r"(k));
return &vals[k+1-keys];
}
}
return *last == key ? &vals[n-1] : 0;
}
On x86-64 the loop comprises the following instructions (view on Compiler Explorer):
.L6:
cmp QWORD PTR [rax], rdx
je .L11
cmp QWORD PTR [rax+8], rdx
je .L12
add rax, 16
cmp rax, rcx
jb .L6
On popular x86 implementations one iteration of this loop can be issued in one cycle. Execution port(s) at which branch instructions can execute is the bottleneck, limiting it to between 1.5 and 3 cycles per two elements.
Alternatives that may look better in microbenchmarks:
This one shines when the initial branch that depends on n
is perfectly predictable.
uint64_t *
switch_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
switch (n-1) {
#define CASE(N) \
case N: if (keys[N] == key) return &vals[N];
CASE(31) CASE(30) CASE(29) CASE(28) CASE(27) CASE(26) CASE(25) CASE(24)
CASE(23) CASE(22) CASE(21) CASE(20) CASE(19) CASE(18) CASE(17) CASE(16)
CASE(15) CASE(14) CASE(13) CASE(12) CASE(11) CASE(10) CASE(9) CASE(8)
CASE(7) CASE(6) CASE(5) CASE(4) CASE(3) CASE(2) CASE(1) CASE(0)
default: return 0;
}
}
Same source compiles to reasonable AVX and ARMv8 NEON assembly (view on Compiler Explorer):
// Poor man's 'branchless select'
static uintptr_t sel(int cond, uintptr_t p, uintptr_t q)
{
asm("" : "+r"(p), "+r"(q));
q = cond ? p : q;
asm("" : "+r"(p), "+r"(q));
return q;
}
uint64_t *
vec_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t n)
{
if (!n) return 0;
uintptr_t res = sel(keys[n-1] == key, (uintptr_t)&vals[n-1], 0);
if (n == 1) return (void *)res;
res = sel(keys[(n|1)-2] == key, (uintptr_t)&vals[(n|1)-2], res);
res = sel(keys[(n|1)-3] == key, (uintptr_t)&vals[(n|1)-3], res);
typedef uint64_t u64v2 __attribute__((vector_size(16)));
typedef u64v2 u64v2_u __attribute__((aligned(8),may_alias));
u64v2_u *vkeys = (void *)keys;
u64v2 vvals01 = { 0, 8 }; vvals01 += (uintptr_t)vals;
u64v2 vvals23 = { 16, 24 }; vvals23 += (uintptr_t)vals;
u64v2 vkey = { key, key };
u64v2 vres = { 0 };
for (; vkeys != (void*)&keys[n&-4]; vkeys += 2) {
u64v2 v01 = vkeys[0] == vkey, v23 = vkeys[1] == vkey;
vres |= vvals01 & v01;
vres |= vvals23 & v23;
vvals01 += 32;
vvals23 += 32;
}
res |= vres[0] | vres[1];
return (void *)res;
}
Upvotes: 2
Reputation: 2082
I have an idea that goes this way:
The idea here is to get rid of the branch itself, which could, due to branch prediction on all newer (since 15 years) processors waste a lot of cycles and bump the performance of this function.
What I want is something that get executed on all keys, such that the result is mixed all together, but will still give the indication of where our flag is.
So the pseudo-idea-code is
res = 0 // Init to some neutral value
res = res <op> f(keys,key) // do an operation to mix the results with something that is function of the "search" method
Elaborating further:
XOR
between a number and itself I get all zeros, while in all other cases I get a non zero value.XOR
ing with 1 toggles a bitif
.Something that has these properties is the following:
1^reduce_or(keys[i]^key) << i
keys[i]^key -> gives 0 if key matches, a random number otherwise
reduce_or -> gives 0 if key matches, 1 otherwise
1^ -> gives 1 if key matches, 0 otherwise
<< i -> moves the 1 to the bit position at which the key was matched
So the final idea:
res = 0
res = res | reduce_or(keys[0]^key) << 0;
res = res | reduce_or(keys[1]^key) << 1;
res = res | reduce_or(keys[2]^key) << 2;
res = res | reduce_or(keys[3]^key) << 3;
res = res | reduce_or(keys[4]^key) << 4;
res = res | reduce_or(keys[5]^key) << 5;
..;
res = res | reduce_or(keys[31]^key) << 31;
After this pass we should have a number like 000000100000000
, and the one is at the index of which the key was found.
We still need to:
To pass from the one hot encoded number to the position, it is just log2 of the result. However, this could be slow. I don't really have a solution for this. Maybe we can do, instead of shifting by i
, multiply by ì
, this will give us a 0 when the reduce_or is 0 and i
, that is, the index already, when it is the right key.
res = 0
res = res + reduce_or(keys[0]^key) * 0;
res = res + reduce_or(keys[1]^key) * 1;
res = res + reduce_or(keys[2]^key) * 2;
res = res + reduce_or(keys[3]^key) * 3;
res = res + reduce_or(keys[4]^key) * 4;
res = res + reduce_or(keys[5]^key) * 5;
..;
res = res + reduce_or(keys[31]^key) * 31;
We should test if doing the log is faster than doing sums and multiplications.
To get the address is just a matter of pointer arithmetic:
addr = vals+res
Anyway, this should give a branchless code :) I am curious to see if it will be faster or not!
Upvotes: 2
Reputation: 21936
I’m not sure how faster it gonna get if at all, but here’s manually vectorized AVX2 version of your function.
uint64_t* flatmap_avx2( const uint64_t* keys, uint64_t* vals, uint64_t key, uint64_t max_size )
{
const __m256i needle = _mm256_set1_epi64x( (int64_t)key );
const uint64_t* const keysEnd = keys + max_size;
const uint64_t* const keysEndAligned = keys + ( max_size / 4 ) * 4;
for( ; keys < keysEndAligned; keys += 4, vals += 4 )
{
__m256i src = _mm256_loadu_si256( ( const __m256i* )keys );
__m256i eq = _mm256_cmpeq_epi64( needle, src );
uint32_t mask = (uint32_t)_mm256_movemask_epi8( eq );
if( 0 == mask )
continue;
uint32_t byteIndex = _tzcnt_u32( mask );
// The index is multiple of 8, in assembly all addresses expressed in bytes,
// yet adding pointers in C adds elements not bytes, that's why casting
return (uint64_t*)( ( (uint8_t*)vals ) + byteIndex );
}
for( ; keys < keysEnd; keys++, vals++ )
if( *keys == key )
return vals;
return nullptr;
}
If you're building this with VC++, ideally add #pragma loop( no_vector )
before the second for
loop in the function.
Similarly, if you’re building with gcc or clang, ideally add __attribute__((optimize("no-tree-vectorize")))
before the whole function.
Without these compiler-specific shenanigans, compilers may decide to automatically vectorize the second for
loop with the remainder, inflating the code for no good reason.
Another performance-related thing. If you can, align your keys pointer by 32 bytes, will become slightly faster.
Upvotes: 5