Reputation: 5388
Background
I've recently been taking some old code (~1998) and re-writing some of it to improve performance. Previously in the basic data structures for a state I stored elements in several arrays, and now I'm using raw bits (for the cases that requires less than 64 bits). That is, before I had an array of b
elements and now I have b
bits set in a single 64-bit integer that indicate whether that value is part of my state.
Using intrinsics like _pext_u64
and _pdep_u64
I've managed to get all operations 5-10x faster. I am working on the last operation, which has to do with computing a perfect hash function.
The exact details of the hash function aren't too important, but it boils down to computing binomial coefficients (n choose k
- n!/((n-k)!k!)
for various n
and k
. My current code uses a large lookup table for this, which is probably hard to speed up significantly on its own (except for possible cache misses in the table which I haven't measured).
But, I was thinking that with SIMD instructions I might be able to directly compute these for several states in parallel, and thus see an overall performance boost.
Some constraints:
b
bits set in each 64-bit state (representing small numbers).k
value in the binomial coefficients is related to b
and changes uniformly in the calculation. These values are small (most of the time <= 5).So, I can fairly easily write out the math for doing this in parallel and for keeping all operations as integer multiple/divide without remainders while keeping within 32 bits. The overall flow is:
n choose k
computation in a way to avoid overflow.But, I haven't written SIMD code before, so I'm still getting up to speed on all the functions available and their caveats/efficiencies.
Example:
Previously I would have had my data in an array, supposing there are always 5 elements:
[3 7 19 31 38]
Now I'm using a single 64-bit value for this:
0x880080088
This makes many other operations very efficient. For the perfect hash I need to compute something like this efficiently (using c
for choose):
(50c5)-(38c5) + (37c4)-(31c4) + (30c3)-(19c3) + ...
But, in practice I have a bunch of these to compute, just with slightly different values:
(50c5)-(Xc5) + ((X-1)c4)-(Yc4) + ((Y-1)c3)-(Zc3) + ...
All the X/Y/Z... will be different but the form of the calculation is identical for each.
Questions:
Is my intuition on gaining efficiency by converting to SIMD operations reasonable? (Some sources suggest "no", but that's the problem of computing a single coefficient, not doing several in parallel.)
Is there something more efficient than repeated _tzcnt_u64
calls for extracting bits into the data structures for SIMD operations? (For instance, I could temporarily break my 64-bit state representation into 32-bit chunks if it would help, but then I wouldn't be guaranteed to have the same number of bits set in each element.)
What are the best intrinsics for computing several sequential multiply/divide operations for the binomial coefficients when I know there won't be overflow. (When I look through the Intel references I have trouble interpreting the naming quickly when going through all the variants - it isn't clear that what I want is available.)
If directly computing the coefficients is unlikely to be efficient, can SIMD instructions be used for parallel lookups into my previous lookup table of coefficients?
(I apologize for putting several questions together, but given the specific context, I thought it would be better to put them together as one.)
Upvotes: 3
Views: 250
Reputation: 5388
Here is one possible solution that does the computation from a lookup table using one state at a time. It's probably going to be more efficient to do this in parallel over several states instead of using a single state. Note: This is hard-coded for the fixed case of getting combinations of 6 elements.
int64_t GetPerfectHash2(State &s)
{
// 6 values will be used
__m256i offsetsm1 = _mm256_setr_epi32(6*boardSize-1,5*boardSize-1,
4*boardSize-1,3*boardSize-1,
2*boardSize-1,1*boardSize-1,0,0);
__m256i offsetsm2 = _mm256_setr_epi32(6*boardSize-2,5*boardSize-2,
4*boardSize-2,3*boardSize-2,
2*boardSize-2,1*boardSize-2,0,0);
int32_t index[9];
uint64_t value = _pext_u64(s.index2, ~s.index1);
index[0] = boardSize-numItemsSet+1;
for (int x = 1; x < 7; x++)
{
index[x] = boardSize-numItemsSet-_tzcnt_u64(value);
value = _blsr_u64(value);
}
index[8] = index[7] = 0;
// Load values and get index in table
__m256i firstLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[0]), offsetsm2);
__m256i secondLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[1]), offsetsm1);
// Lookup in table
__m256i values1 = _mm256_i32gather_epi32(combinations, firstLookup, 4);
__m256i values2 = _mm256_i32gather_epi32(combinations, secondLookup, 4);
// Subtract the terms
__m256i finalValues = _mm256_sub_epi32(values1, values2);
_mm256_storeu_si256((__m256i*)index, finalValues);
// Extract out final sum
int64_t result = 0;
for (int x = 0; x < 6; x++)
{
result += index[x];
}
return result;
}
Note that I actually have two similar cases. In the first case I don't need the _pext_u64
and this code is ~3x slower than my existing code. In the second case I need it, and it is 25% faster.
Upvotes: 0