Reputation: 3024
There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.
The actual processing of the 2nd series of elements involves logical and (&) and a population count.
I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.
Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!
Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)
#include <stdio.h>
#include <stdint.h>
#define PEOPLE 1000000 // 1m
struct Person {
uint8_t age; // Filtering condition
uint8_t cnt; // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags
#define EVENTS 100000000 // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags
void init_arrays() {
for (int i = 0; i < PEOPLE; i++) { // just some stuff
P[i].age = i & 0x07;
P[i].cnt = i % 220; // assert( sum < EVENTS );
}
for (int i = 0; i < EVENTS; i++) {
P1[i] = i % 7; // just some stuff
P2[i] = i % 9; // just some other stuff
}
}
int main(int argc, char *argv[])
{
uint64_t sum = 0, fcur = 0;
int age_filter = 7; // just some
init_arrays(); // Init P, P1, P2
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
sum += __builtin_popcount( P1[fcur] & P2[fcur] );
else
fcur += P[p].cnt; // skip this person's events
printf("(dummy %ld %ld)\n", sum, fcur );
return 0;
}
gcc -O5 -march=native -std=c99 test.c -o test
Upvotes: 5
Views: 353
Reputation: 3024
To tackle the branch mispredictions (missing in other answers) the code could do something like:
#ifdef MISPREDICTIONS
if (cond)
sum += value
#else
mask = - (cond == 0); // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif
It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.
Upvotes: 0
Reputation: 727077
Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:
uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
if (P[p].age < age_filter) {
int64_t e = P[p].cnt;
for ( ; e >= 8 ; e -= 8) {
sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
p1 += 8;
p2 += 8;
}
for ( ; e ; e--) {
sum += __builtin_popcount( *p1++ & *p2++ );
}
} else {
p1 += P[p].cnt;
p2 += P[p].cnt;
}
}
In my testing this speeds up your code from 1.515s to 0.855s.
Upvotes: 4
Reputation: 20087
A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.
Upvotes: 2
Reputation: 18368
If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.
Upvotes: 0
Reputation: 20087
The answer by Neil doesn't require sorting by age, which btw could be a good idea --
If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people
sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];
Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.
If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255]))
for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).
Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.
Upvotes: 2
Reputation: 286
I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)
depending on what you want to achieve, -O3 can be slower than -O2
as with your problem, I'd suggest thinking more about your data structure than the actual algorithm. you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.
if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.
Upvotes: 0
Reputation: 55432
I don't know if your sample code accurately reflects your problem but it can be rewritten like this:
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
fcur += P[p].cnt;
for (int64_t e = 0; e < fcur ; e++)
sum += __builtin_popcount( P1[e] & P2[e] );
Upvotes: 1