Reputation: 79

Algorithm for square root calculation

I have been implementing control software in C and one of the control algorithms requires square root calculation. I have been looking for suitable square root calculation algorithm which will have constant execution time irrespective to the radicand value. This requirement rules out the sqrt function from the standard library.

As far as my platform I have been working with floating point 32 bits ARM Cortex A9 based machine. As far as the radicand range in my application the algorithms are calculated in physical units so I expect following range <0, 400>. As far as the required error I think that error about 1 % could be sufficient. Can anybody recommend me a square root calculation algorithm suitable for my purposes?

Upvotes: 1

Answers (3)

Professor Jimatura

Reputation: 79

I have decided to use following approach. I have chosen the Newton method and then I have experimentally set the fixed number of iterations so that the error in whole range of the radicand i.e. <0,400> doesn't exceed the prescribed value. I have ended at six iterations. As far as the radicand with value 0 I have decided to return 0 without any calculations.

Upvotes: 0

Aki Suihkonen

Reputation: 20057

Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation vrsqrte_f32 for two simultaneous approximations and vrsqrteq_f32 for four approximations. (The scalar variant vrsqrtes_f32 is only available on Arm64 v8.2).

Then the result can be simply calculated by x * vrsqrte_f32(x);, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf

ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.

At x==0 vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.

If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:

float sqrtest(float a) {
    // need to "transfer" or "convert" the scalar input 
    // to a vector of two
    // - optimally we would not need an instruction for that
    // but we would just let the processor calculate the instruction
    // for all the lanes in the register
    float32x2_t a2 = vdup_n_f32(a);

    // next we create a mask that is all ones for the legal
    // domain of 1/sqrt(x)
    auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));

    // calculate two reciprocal estimates in parallel 
    float32x2_t a2est = vrsqrte_f32(a2);

    // we need to mask the result, so that effectively
    // all non-legal values of a2est are zeroed
    a2est = vand_u32(is_legal, a2est);

    // x * 1/sqrt(x) == sqrt(x)
    a2 = vmul_f32(a2, a2est);

    // finally we get only the zero lane of the result
    // discarding the other half
    return vget_lane_f32(a2, 0);
}

Surely this method will have almost twice the throughput with

void sqrtest2(float &a, float &b) {
    float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
    float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
    float32x2_t a2est = vrsqrte_f32(a2);
    a2est = vand_u32(is_legal, a2est);
    a2 = vmul_f32(a2, a2est);
    a = vget_lane_f32(a2,0); 
    b = vget_lane_f32(a2,1); 
}

And even better, if you can work directly with float32x2_t or float32x4_t inputs and outputs.

float32x2_t sqrtest2(float32x2_t a2) {
    float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
    float32x2_t a2est = vrsqrte_f32(a2);
    a2est = vand_u32(is_legal, a2est);
    return vmul_f32(a2, a2est);
}

This implementation gives sqrtest2(1) == 0.998 and sqrtest2(400) == 19.97 (tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.

Upvotes: 1

4386427

Reputation: 44340

My initial approach would be to use the Taylor serie for square root with precalculated coefficients at a number of fixed points. This will reduce the calculation to a subtraction and a number of multiplication.

The look-up table would be a 2D array like:

point | C0  | C1  | C2  | C3  | C4  | ...
-----------------------------------------
 0.5  | f00 | f01 | f02 | f03 | f04 |
-----------------------------------------
 1.0  | f10 | f11 | f12 | f13 | f14 |
-----------------------------------------
 1.5  | f20 | f21 | f22 | f23 | f24 |
-----------------------------------------
....

So when calculating sqrt(x) use the table row with the point closest to x.

Example:

sqrt(1.1) (i.e. use point 1.0 coeffients)

f10 + 
f11 * (1.1 - 1.0) + 
f12 * (1.1 - 1.0) ^ 2 + 
f13 * (1.1 - 1.0) ^ 3 + 
f14 * (1.1 - 1.0) ^ 4

The table above suggest a fixed distance between the points at which you precalculate coeffients (i.e. 0.5 between each point). However, due to the natur of square root you may find that the distance between points shall differ for different ranges of x. For instance x in [0 - 1] -> distance 0.1,x in [1 - 2] -> distance 0.25, x in [2 - 10] -> distance 0.5 and so on.

Another thing is the number of terms needed to get the desired precision. Here you may also find that different ranges of x may require a different number of coefficients.

All this is easy to precalculation on a normal computer (e.g. using excel).

Note: For values very close to zero this method isn't good. Maybe Newtons method will be a better choice.

Taylor series: https://en.wikipedia.org/wiki/Taylor_series

Newtons method: https://en.wikipedia.org/wiki/Newton%27s_method

Also relevant: https://math.stackexchange.com/questions/291168/algorithms-for-approximating-sqrt2

Upvotes: 1

Algorithm for square root calculation

Answers (3)

Related Questions