Reputation: 46943
I have been requested by a friend that I share what I have stumbled upon some time in the past. The original post is taken from here. The problem statement can be found here. Basically a site for algorithmic competitions.
I was put infront of a algorithmic problem that I solved using the following code:
double dp[80002][50];
class FoxListeningToMusic {
public:
vector <double> getProbabilities(vector <int> length, int T) {
memset(dp, 0, sizeof(dp));
int n = length.size();
for(int i = 0; i < n; i++)
dp[0][i] = 1.0 / (double)n;
double mul = 1.0 / (double)n;
int idx ;
for(int i = 1; i <= T; i++) {
for(int j = 0; j < n; j++) {
idx = i - length[j];
if(idx >= 0) {
for(int k = 0; k < n; k++)
dp[i][k] += mul * dp[idx][k];
}
else
dp[i][j] += mul;
}
}
}
vector<double> v(n);
for(int i = 0; i < n; i++)
v[i] = dp[T][i];
return v;
}
};
It is not important weather the code is solving the problem with correct answers, at least for what I am going to discuss. The fact is that I got time limit with this code (meaning it executed for more than 2 seconds on some test cases). It was somehow expected as the complexity here is O(T * length.size() ^ 2), which becomes 2 * 108 if we take into account the problem constraints. However, the interesting thing is that I tested my solution especially agains the time limit. The case I used seems to be "worst case" for my solution: 50 1s given in length and T = 80000. The code ran for 0.75 seconds. This is quite below the time limit of 2 seconds.
I say the case I used is the worst case, because the number of instructions what will be executed depends only on the branching condition idx >= 0 in the inner for. If this is true one more cycle is to be executed (the cycle is of complexity O(n)). In the other case only a single operation O(1) will be executed. And as you can see the less the elements in length the more times this becomes true.
Even though this reasoning, my problem fails after testing on the following case:
length = {1, 1, 1, 1, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 1, 2, 3, 1, 2, 3, 2,
1, 3, 1, 1, 1, 2, 3, 2, 3, 2, 2, 1, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1,
1, 3, 2, 76393} T= 77297.
For this case my program runs for 5.204000 seconds.
My first assumption was that the reason for this unexpected ratio of runtime measurements (as long as we should expect that in the first case quite fewer processor instructions are to be executed) was that the processor caches somehow the similar calculations: in my example the calculations are symmetric with regard to all the elements of length and really clever processor can use this to spare repeating the same sequence of instructions. So I tried composing another example: this time with different values in length array:
length = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 77943}
T=80000 runs for 0.813000 seconds.
After this example I am no longer able to say how come these time measures - my second example seems to require more processor instructions than the test that failed me and does not allow the caching I thought was happening in the first example. Actually I have not been able to define the cause of this behaviour, but I am quite sure it should have something to do either with processor caches or conveyors. I am very curios how those experiments will behave on different chipsets, so feel free to comment here.
Also, if there is anyone more knowledgeable in hardware than me and he/she can explain this behaviour it will be appreciated.
Until then there is a note I should make for myself - when estimating the algorithm complexity do not underestimate the processor optimizations. Sometimes, they seem to decrease/increase significantly the amortized speed of specific examples.
Upvotes: 3
Views: 292
Reputation: 15163
Another option to deal with this kind of situation is to use fixed-point operations and avoid floating points altogether. The problem statement requires that the answer be accurate up to 1e-9, and since 2^64 is about 10^19, and you are only doing 80000 iterations at most, this is plenty of precision. The way this would work is define a large constant, say
const uint64_t ONE = pow(10,17);
You initialize your array of uint64_t
to ONE/n
rather than 1.0/double(n)
, and the main loop would look like this:
for(int i = 1; i <= T; i++) {
for(int j = 0; j < n; j++) {
idx = i - length[j];
if(idx >= 0) {
for(int k = 0; k < n; k++){
dpi[i][k] += dpi[idx][k];
}
}
else
dpi[i][j] += ONE;
}
for(int k = 0; k < n; k++){
dpi[i][k] = dpi[i][k]/n;
}
}
In theory, this should be faster, since you are avoiding floating point operations in the main loop, and the inner loop consists only of integer additions. On my machine the performance improvement is only about 10%, which indicates that the real bottleneck might be memory access. But, in other situations, you might see a bigger performance improvement.
Upvotes: 1
Reputation: 46943
The reason for this weird behaviour turned out to be the denormal numbers. Putting code to treat such numbers as pure zeroes sped up my code immensely on such corner cases.
HINT: Denormal numbers in this case are numbers that are pretty close to zero (e.g. 10-38 for floats; correction due to @PascalCuoq). For such numbers the processor gets a lot slower in processing, because of this: (taken from wikipedia):
Some systems handle denormal values in hardware, in the same way as normal values. Others leave the handling of denormal values to system software, only handling normal values and zero in hardware. Handling denormal values in software always leads to a significant decrease in performance.
EDIT I also found this suggestion on SO how you can check if the number became denormal.
Upvotes: 7