Reputation: 12865
Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?
So, I need to calculate 1/X. Is there some workaround so I lose precision but do it faster?
Upvotes: 30
Views: 27200
Reputation: 1
Taking the magic number approach from treating the float bits as it's own log (with modifications by some constant) you can get very close with simple methods.
First we recalculate the derivation that the fast inverse square root does.
log(1/x) = -log(x)
log(y) = I/2^23(127 - mu)
-log(x) = -1*I/2^23(127 - mu)
log(y) = -log(x)
I/2^23(127 - mu) = -1*I/2^23(127 - mu)
I = (2 * (127 - mu) * 2^23) - I
For a rough and ready approximation then the magic value is 0x7F000000 when mu is 0.
i.e
I = 0x7F000000 - I
However the error is off center:Error when mu = 0
but we need to center the error around the target, so choose mu to minimize that. I found for reciprocal, that mu = 0.05 is optimal. so the magic number would be 0x7EEEEEEE
Error is center:Error when mu = 0.05
However there is an even better answer we could get. I noticed that the span of the error was reduced at certain mu values. Basically the size of the wobble was greatly reduced. However this meant the offset was set far outside of center. You then just scale the result with a multiply.
Lets take mu = 0.8. Magic number = 0x7E333333
Look at how the error flattened out, but now is only offset from the target.
Error is flatter, but offset:Error when mu = 0.8
Now we just scale by 2.81 to recenter it.
Error is flatter and centered:Error when mu = 0.8 and scaled by 2.81
final code:
float Q_inv(float x){
long i;
float y;
y = x;
i = *(long*)&y;
i = 0x7E333333 - i;
y = *(float*)&y;
return y * 2.81;
}
Upvotes: 0
Reputation: 61239
The rcpss assembly instruction computes an approximate reciprocal with |Relative Error| ≤ 1.5 ∗ 2^−12.
You can enable it on a compiler with the -mrecip
flag (you might also need -ffast-math
).
The instrinsic is _mm_rcp_ss
.
Upvotes: 0
Reputation: 33
I’ve bench tested these methods on Arduino NANO for speed and 'accuracy'.
Basic calculation was to setup variables, Y = 15,000,000 and Z = 65,535
(in my real case, Y is a constant and Z can vary between 65353 and 3000 so a useful test)
Calc time on Arduino was established by putting pin low, then high as calc made and then low again and comparing on times with logic analyser. FOR 100 CYCLES.
With variables as unsigned integers:-
Y * Z takes 0.231 msec
Y / Z takes 3.867 msec.
With variables as floats:-
Y * Z takes 1.066 msec
Y / Z takes 4.113 msec.
Basic Bench Mark and ( 15,000,000/65535 = 228.885 via calculator.)
Using {Jack G’s} float reciprocal algorithm:
Y * reciprocal(Z) takes 1.937msec which is a good improvement, but accuracy less so 213.68.
Using {nimig18’s} float inv_fast:
Y* inv_fast(Z) takes 5.501 msec accuracy 228.116 with single iteration
Y* inv_fast(Z) takes 7.895 msec accuracy 228.883 with second iteration
Using Wikipedia's Q_rsqrt (pointed to by {Jack G})
Y * Q*rsqrt(Z) takes 6.104 msec accuracy 228.116 with single iteration
All entertaining but ultimately disappointing!
Upvotes: 3
Reputation: 5322
I believe that what he was looking for is a more efficient way to approximate 1.0/x instead of some technical definition of approximation that states that you could use 1 as a very imprecise answer. I also believe that this satisfies that.
#ifdef __cplusplus
#include <cstdint>
#else
#include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
union {
double dbl;
#ifdef __cplusplus
std::uint_least64_t ull;
#else
uint_least64_t ull;
#endif
} u;
u.dbl = x;
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> 1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.dbl;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
union {
float single;
#ifdef __cplusplus
std::uint_least32_t uint;
#else
uint_least32_t uint;
#endif
} u;
u.single = x;
u.uint = ( 0xbe6eb3beU - u.uint ) >> 1;
// pow( x, -0.5 )
u.single *= u.single; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
return u.single;
}
Hmm....... I wounder if the CPU manufactures knew you could approximate the reciprocal with only a single multiply, subtraction, and bit-shift when they designed the CPU.... hmm.........
As for bench-marking, the hardware x2 instructions combined with the hardware subtraction instructions are just as fast as hardware 1.0/x instruction on modern day computers (my benchmarks were on an Intel i7, but I would assume similar results for other processors). However, if this algorithm were implemented into the hardware as a new assembly instruction, then the increase in speed would probably be good enough for this instruction to be quite practical.
For more information about this method, this implementation is based off the wonderful "fast" inverse square root algorithm.
As Pharap brought to my attention, reading an inactive property from a union is undefined behavior, so there are two possible solutions that I have devised from his helpful comments to avoid undefined behavior. The first solution seems more like an unsavory trick to get around a language semantic that is practically no better than the original solution.
#ifdef __cplusplus
#include <cstdint>
#else
#include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
union {
double dbl[2];
#ifdef __cplusplus
std::uint_least64_t ull[2];
#else
uint_least64_t ull[2];
#endif
} u;
u.dbl[0] = x; // dbl is now the active property, so only dbl can be read now
u.ull[1] = 0;//trick to set ull to the active property so that ull can be read
u.ull][0] = ( 0xbfcdd6a18f6a6f52ULL - u.ull[0] ) >> 1;
u.dbl[1] = 0; // now set dbl to the active property so that it can be read
u.dbl[0] *= u.dbl[0];
return u.dbl[0];
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
union {
float flt[2];
#ifdef __cplusplus
std::uint_least32_t ull[2];
#else
uint_least32_t ull[2];
#endif
} u;
u.flt[0] = x; // now flt is active
u.uint[1] = 0; // set uint to be active for reading and writing
u.uint[0] = ( 0xbe6eb3beU - u.uint[0] ) >> 1;
u.flt[1] = 0; // set flt to be active for reading and writing
u.flt[0] *= u.flt[0];
return u.flt[0];
}
The second possible solution is much more palatable because it gets rid of unions altogether. However, this solution will be much slower if it is not properly optimized by the compiler. But, on the upside, the solution below will be completely agnostic of byte-order provided:
#ifdef __cplusplus
#include <cstdint>
#include <cstring>
#define stdIntWithEightBits std::uint8_t
#define stdIntSizeOfFloat std::uint32_t
#define stdIntSizeOfDouble std::uint64_t
#else
#include <stdint.h>
#include <string.h>
#define stdIntWithEightBits uint8_t
#define stdIntSizeOfFloat uint32_t
#define stdIntSizeOfDouble uint64_t
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
double byteIndexFloat = 1.1212798184631136e-308;//00 08 10 18 20 28 30 38 bits
stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
stdIntSizeOfDouble inputAsUll = (
(inputBytes[0] << byteIndexs[0]) |
(inputBytes[1] << byteIndexs[1]) |
(inputBytes[2] << byteIndexs[2]) |
(inputBytes[3] << byteIndexs[3]) |
(inputBytes[4] << byteIndexs[4]) |
(inputBytes[5] << byteIndexs[5]) |
(inputBytes[6] << byteIndexs[6]) |
(inputBytes[7] << byteIndexs[7])
);
inputAsUll = ( 0xbfcdd6a18f6a6f52ULL - inputAsUll ) >> 1;
double outputDouble;
const stdIntWithEightBits outputBytes[] = {
inputAsUll >> byteIndexs[0],
inputAsUll >> byteIndexs[1],
inputAsUll >> byteIndexs[2],
inputAsUll >> byteIndexs[3],
inputAsUll >> byteIndexs[4],
inputAsUll >> byteIndexs[5],
inputAsUll >> byteIndexs[6],
inputAsUll >> byteIndexs[7]
};
memcpy(&outputDouble, &outputBytes, 8);
return outputDouble * outputDouble;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
float byteIndexFloat = 7.40457e-40; // 0x00 08 10 18 bits
stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
stdIntSizeOfFloat inputAsInt = (
(inputBytes[0] << byteIndexs[0]) |
(inputBytes[1] << byteIndexs[1]) |
(inputBytes[2] << byteIndexs[2]) |
(inputBytes[3] << byteIndexs[3])
);
inputAsInt = ( 0xbe6eb3beU - inputAsInt ) >> 1;
float outputFloat;
const stdIntWithEightBits outputBytes[] = {
inputAsInt >> byteIndexs[0],
inputAsInt >> byteIndexs[1],
inputAsInt >> byteIndexs[2],
inputAsInt >> byteIndexs[3]
};
memcpy(&outputFloat, &outputBytes, 4);
return outputFloat * outputFloat;
}
Disclaimer: Lastly, please note that I am more of a novice in C++. As such, I welcome with wide open arms any best-practices, proper-formatting, or implication-clarity edits to the end of both improving the quality of this answer for all who read it and expanding my knowledge of C++ for all my years to come.
Upvotes: 15
Reputation: 876
This should do it with a number of pre-unrolled newton iterations's evaluated as a Horner polynomial which uses fused-multiply accumulate operations most modern day CPU's execute in a single Clk cycle (every time):
float inv_fast(float x) {
union { float f; int i; } v;
float w, sx;
int m;
sx = (x < 0) ? -1:1;
x = sx * x;
v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
w = x * v.f;
// Efficient Iterative Approximation Improvement in horner polynomial form.
v.f = v.f * (2 - w); // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
// v.f = v.f * ( 4 + w * (-6 + w * (4 - w))); // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
// v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w))))))); // Third Iteration, Err = +-6.8e-8 * 2^(-flr(log2(x)))
return v.f * sx;
}
Fine Print: Closer to 0, the approximation does not do so well so either you the programmer needs to test the performance or restrict the input from getting to low before resorting to hardware division. i.e. be responsible!
Upvotes: 2
Reputation: 18203
The fastest way that I know of is to use SIMD operations. http://msdn.microsoft.com/en-us/library/796k1tty(v=vs.90).aspx
Upvotes: 0
Reputation: 13908
First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.
Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X
appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;
) and then using Y
. (The compiler optimization might do this in any case.)
Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.
How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.
However, in general, there's no way to speed up division. If there were, compilers would already be doing it.
Upvotes: 3
Reputation: 29519
First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?
As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double
datatype for either the 1 or the divisor. Floats are much faster.
That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.
Upvotes: 6