How to fix this floating point square root algorithm

Question

I am trying to compute the IEEE-754 32-bit Floating Point Square Root of various inputs but for one particular input the below algorithm based upon the Newton-Raphson method won't converge, I am wondering what I can do to fix the problem? For the platform I am designing I have a 32-bit floating point adder/subtracter, multiplier, and divider.

For input 0x7F7FFFFF (3.4028234663852886E38)., the algorithm won't converge to the correct answer of 18446743523953729536.000000 This algorithm's answer gives 18446743523953737728.000000.

I am using MATLAB to implement my code before I implement this in hardware. I can only use single precision floating point values, (SO NO DOUBLES).

clc; clear; close all;

% Input
R = typecast(uint32(hex2dec(num2str(dec2hex(((hex2dec('7F7FFFFF'))))))),'single')

% Initial estimate
OneOverRoot2 = single(1/sqrt(2));
Root2 = single(sqrt(2));

% Get low and high bits of input R
hexdata_high = bitand(bitshift(hex2dec(num2hex(single(R))),-16),hex2dec('ffff'));
hexdata_low = bitand(hex2dec(num2hex(single(R))),hex2dec('ffff'));

% Change exponent of input to -1 to get Mantissa
temp = bitand(hexdata_high,hex2dec('807F'));
Expo = bitshift(bitand(hexdata_high,hex2dec('7F80')),-7);
hexdata_high = bitor(temp,hex2dec('3F00'));
b = typecast(uint32(hex2dec(num2str(dec2hex(((bitshift(hexdata_high,16)+ hexdata_low)))))),'single');

% If exponent is odd ...
if (bitand(Expo,1))
    % Pretend the mantissa [0.5 ... 1.0) is multiplied by 2 as Expo is odd,
    %   so it now has the value [1.0 ... 2.0)
    % Estimate the sqrt(mantissa) as [1.0 ... sqrt(2))
    % IOW: linearly map (0.5 ... 1.0) to (1.0 ... sqrt(2))
    Mantissa = (Root2 - 1.0)/(1.0 - 0.5)*(b - 0.5) + 1.0;
else
    % The mantissa is in range [0.5 ... 1.0)
    % Estimate the sqrt(mantissa) as [1/sqrt(2) ... 1.0)
    % IOW: linearly map (0.5 ... 1.0) to (1/sqrt(2) ... 1.0)
    Mantissa = (1.0 - OneOverRoot2)/(1.0 - 0.5)*(b - 0.5) + OneOverRoot2;
end

newS = Mantissa*2^(bitshift(Expo-127,-1));
S=newS

% S = (S + R/S)/2 method
for j = 1:6 
    fprintf('S  %u %f %f
', j, S, (S-sqrt(R)));
    S = single((single(S) + single(single(R)/single(S))))/2;
    S = single(S);
end

goodaccuracy =  (abs((single(S)-single(sqrt(single(R)))))) < 2^-23
difference = (abs((single(S)-single(sqrt(single(R))))))

% Get hexadecimal output
hexdata_high = (bitand(bitshift(hex2dec(num2hex(single(S))),-16),hex2dec('ffff')));
hexdata_low = (bitand(hex2dec(num2hex(single(S))),hex2dec('ffff')));
fprintf('FLOAT: T  Input: %e		Correct: %e		My answer: %e
', R, sqrt(R), S);
fprintf('output hex = 0x%04X%04X
',hexdata_high,hexdata_low);
out = hex2dec(num2hex(single(S)));

tmyklebu · Accepted Answer

I took a whack at this. Here's what I came up with:

float mysqrtf(float f) {
  if (f < 0) return 0.0f/0.0f;
  if (f == 1.0f / 0.0f) return f;
  if (f != f) return f;

  // half-ass an initial guess of 1.0.
  int expo;
  float foo = frexpf(f, &expo);
  float s = 1.0;
  if (expo & 1) foo *= 2, expo--;

  // this is the only case for which what's below fails.
  if (foo == 0x0.ffffffp+0) return ldexpf(0x0.ffffffp+0, expo/2);

  // do four newton iterations.
  for (int i = 0; i < 4; i++) {
   float diff = s*s-foo;
    diff /= s;
    s -= diff/2;
  }

  // do one last newton iteration, computing s*s-foo exactly.
  float scal = s >= 1 ? 4096 : 2048;
  float shi = (s + scal) - scal; // high 12 bits of significand
  float slo = s - shi; // rest of significand
  float diff = shi * shi - foo; // subtraction exact by sterbenz's theorem
  diff += 2 * shi * slo; // opposite signs; exact by sterbenz's theorem
  diff += slo * slo;
  diff /= s; // diff == fma(s, s, -foo) / s.
  s -= diff/2;

  return ldexpf(s, expo/2);
}

The first thing to analyse is the formula (s*s-foo)/s in floating-point arithmetic. If s is a sufficiently good approximation to sqrt(foo), Sterbenz's theorem tells us that the numerator is within an ulp(foo) of the right answer --- all of that error is approximation error from computing s*s. Then we divide by s; this gives us at worst another half-ulp of approximation error. So, even without a fused multiply-add, diff is within 1.5 ulp of what it should be. And we divide it by two.

Notice that the initial guess doesn't in and of itself matter as long as you follow it up with enough Newton iterations.

Measure the error of an approximation s to sqrt(foo) by abs(s - foo/s). The error of my initial guess of 1 is at most 1. A Newton iteration in exact arithmetic squares the error and divides it by 4. A Newton iteration in floating-point arithmetic --- the kind I do four times --- squares the error, divides it by 4, and kicks in another 0.75 ulp of error. You do this four times and you find you have a relative error at most 0x0.000000C4018384, which is about 0.77 ulp. This means that four Newton iterations yield a faithfully-rounded result.

I do a fifth Newton step to get a correctly-rounded square root. The reason why it works is a little more intricate.

shi holds the "top half" of s while slo holds the "bottom half." The last 12 bits in each significand will be zero. This means, in particular, that shi * shi and shi * slo and slo * slo are exactly representable as floats.

s*s is within two ulps of foo. shi*shi is within 2047 ulps of s*s. Thus shi * shi - foo is within 2049 ulps of zero; in particular, it's exactly representable and less than 2^-10.

You can check that you can add 2 * shi * slo and get an exactly-representable result that's within 2^-22 of zero and then add slo*slo and get an exactly representable result --- s*s-foo computed exactly.

When you divide by s, you kick in an additional half-ulp of error, which is at most 2^-48 here since our error was already so small.

Now we do a Newton step. We've computed the current error correctly to within 2^-46. Adding half of it to s gives us the square root to within 3*2^-48.

To turn this into a guarantee of correct rounding, we need to prove that there are no floats between 1/2 and 2, other than the one I special-cased, whose square roots are within 3*2^-48 of a midpoint between two consecutive floats. You can do some error analysis, get a Diophantine equation, find all of the solutions of that Diophantine equation, find which inputs they correspond to, and work out what the algorithm does on those. (If you do this, there is one "physical" solution and a bunch of "unphysical" solutions. The one real solution is the only thing I special-cased.) There may be a cleaner way, however.

How to fix this floating point square root algorithm

Answers (1)

Related Questions