Reputation: 6206
I am working on a fixed-point platform (floating-point arithmetic not supported).
I represent any rational number q
as the floor value of q * (1 << precision)
.
I need an efficient method for calculating log base 2 of x
, where 1 < x < 2
.
Here is what I've done so far:
uint64_t Log2(uint64_t x, uint8_t precision)
{
uint64 res = 0;
uint64 one = (uint64_t)1 << precision;
uint64 two = (uint64_t)2 << precision;
for (uint8_t i = precision; i > 0 ; i--)
{
x = (x * x) / one; // now 1 < x < 4
if (x >= two)
{
x >>= 1; // now 1 < x < 2
res += (uint64_t)1 << (i - 1);
}
}
return res;
}
This works well, however, it takes a toll on the overall performance of my program, which requires executing this for a large amount of input values.
For all it matters, the precision
used is 31, but this may change so I need to keep it as a variable.
Are there any optimizations that I can apply here?
I was thinking of something in the form of "multiply first, sum up last".
But that would imply calculating x ^ (2 ^ precision)
, which would very quickly overflow.
I have previously tried to get rid of the branch, but it just made things worse:
for (uint8_t i = precision; i > 0 ; i--)
{
x = (x * x) / one; // now 1 < x < 4
uint64_t n = x / two;
x >>= n; // now 1 < x < 2
res += n << (i - 1);
}
return res;
Upvotes: 7
Views: 995
Reputation: 1
My proposal would go from opposite direction -- into a use of a constant-performance at fixed number of steps.
Given a reasonable small amount of resources will still suffice and the precision target is known and always reached, the constant-performance deployment can beat most iterative schemes.
A Taylor expansion ( since 1715 ) of log2(x) provides both a solid calculus basement plus (almost) infinite precision a-priori known to be feasible for any depth of fixed-point arithmetics ( be it for Epiphany / FPGA / ASIC / you keep it private / ... )
Math transforms the whole problem into an optionally small amount of a few node points X_tab_i
, for which ( as few as platform precision requires ) constants are pre-calculated for each node point. The rest is a platform-efficient assembly of Taylor sum of products, granting the result is obtained both in constant-time + having a residual error under design-driven threshold ( the target PSPACE x PTIME constraints tradeoff here is obvious for design phase, yet the process is always a CTIME, CSPACE once deployed )
Given X: lookup closest X_tab_i,
with C0_tab_i, C1_tab_i, C2_tab_i, .., Cn_tab_i
//-----------------------------------------------------------------<STATIC/CONST>
// ![i]
#DEFINE C0_tab_i <log2( X_tab_i )>
#DEFINE C1_tab_i < ( X_tab_i )^(-1) * ( +1 / ( 1 * ln(2) )>
#DEFINE C2_tab_i < ( X_tab_i )^(-2) * ( -1 / ( 2 * ln(2) )>
#DEFINE C3_tab_i < ( X_tab_i )^(-3) * ( +1 / ( 3 * ln(2) )>
::: : : :
#DEFINE CN_tab_i < ( X_tab_i )^(-N) * ( -1^(N-1) ) / ( N * ln(2) )>
// -----------------------------------------------------------------<PROCESS>-BEG
DIFF = X - X_tab_i; CORR = DIFF;
RES = C0_tab_i
+ C1_tab_i * CORR; CORR *= DIFF;
RES += C2_tab_i * CORR; CORR *= DIFF;
... +=
RES += Cn_tab_i * CORR; CORR *= DIFF;
// --------------------------------------------------------------<PROCESS>-END:
Upvotes: 1
Reputation: 98358
The only things I can think of is to do the loop with a right-shift instead of a decrement and change a few operations to their equivalent binary ops. That may or may not be relevant to your platform, but in my x64 PC they yield an improvement of about 2%:
uint64_t Log2(uint64_t x, uint8_t precision)
{
uint64_t res = 0;
uint64_t two = (uint64_t)2 << precision;
for (uint64_t b = (uint64_t)1 << (precision - 1); b; b >>= 1)
{
x = (x * x) >> precision; // now 1 < x < 4
if (x & two)
{
x >>= 1; // now 1 < x < 2
res |= b;
}
}
return res;
}
Upvotes: 1