Reputation: 22542
I am trying to vectorize a logical validation problem to run on Intel 64.
I will first try to describe the problem:
I have a static array v[]
of 70-bit integers (appx 400,000 of them) which are all known at compile time.
A producer creates 70-bit integers a
, a lot of them, very quickly.
For each a
I need to find out if there exists an element from v
for which v[i] & a == 0
.
So far my implementation in C is something like this (simplified):
for (; *v; v++) {
if (!(a & *v))
return FOUND;
}
// a had no matching element in v
return NOT_FOUND;
I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a
and *v
into an XMM
register each and calling the PTEST
instruction to do the validation.
I am wondering if there is a way to expand this to use all 256 bits of the new YMM
registers?
Maybe packing 3x70 bits into a single register?
I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.
A couple things that we know about the nature of the input:
v[]
have very few bits setv[]
in any way to make it use less then 70 bitsFOUND
condition is expected to be satisfied after checking appx 20% on v[]
on average.a
before checking them in a batch.v[]
matched, only that one did or not.a
requires very little memory, so anything left in L1 from the previous call is likely to still be there.The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions. I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.
Upvotes: 3
Views: 339
Reputation: 16439
This sounds like you what you really need is a better data structure to store the v[], so that searches take less than linear time.
Consider that if (v[0] & v[1]) & a
is not zero, then neither (v[0] & a)
nor (v[1] & a)
can be zero. This means it is possible to create a tree structure where the v[] are the leaves, and the parent nodes are the AND combination of their children. Then, if parentNode & a
gives you a non-zero value, you can skip looking at the children.
However, this isn't necessarily helpful - the parent node only ends up testing the bits common between the children, so if there are only a few of those, you still end up testing lots of leave nodes. But if you can find clusters in your data set and group many similar v[] under a common parent, this may drastically reduce the number of comparisons you have to do.
On the other hand, such a tree search involves a lot of conditional branches (expensive), and would be hard to vectorize. I'd first try if you can get away with just two levels: first do a vectorized search among the cluster parent nodes, then for each match do a search for the entries in that cluster.
Actually here's another idea, to help with the fact that 70 bits don't fit well into registers: You could split v[] into 64 (=2^6) different arrays. Of the 70 bits in the original v[], the 6 most significant bits are used to determine which array will contain the value, and only the remaining 64 bits are actually stored in the array.
By testing the mask a
against the array indices, you will know which of the 64 arrays to search (in the worst case, if a
doesn't have any of the 6 highest bits set, that'll be all of them), and each individual array search deals only with 64 bits per element (much easier to pack).
In fact this second approach could be generalized into a tree structure as well, which would give you some sort of trie.
Upvotes: 4