Reputation: 11489

Integer cube root

I'm looking for fast code for 64-bit (unsigned) cube roots. (I'm using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnostic.) I will denote by ulong a 64-bit unisgned integer.

Given an input n I require the (integral) return value r to be such that

r * r * r <= n && n < (r + 1) * (r + 1) * (r + 1)

That is, I want the cube root of n, rounded down. Basic code like

return (ulong)pow(n, 1.0/3);

is incorrect because of rounding toward the end of the range. Unsophisticated code like

ulong
cuberoot(ulong n)
{
    ulong ret = pow(n + 0.5, 1.0/3);
    if (n < 100000000000001ULL)
        return ret;
    if (n >= 18446724184312856125ULL)
        return 2642245ULL;
    if (ret * ret * ret > n) {
        ret--;
        while (ret * ret * ret > n)
            ret--;
        return ret;
    }
    while ((ret + 1) * (ret + 1) * (ret + 1) <= n)
        ret++;
    return ret;
}

gives the correct result, but is slower than it needs to be.

This code is for a math library and it will be called many times from various functions. Speed is important, but you can't count on a warm cache (so suggestions like a 2,642,245-entry binary search are right out).

For comparison, here is code that correctly calculates the integer square root.

ulong squareroot(ulong a) {
    ulong x = (ulong)sqrt((double)a);
    if (x > 0xFFFFFFFF || x*x > a)
        x--;
    return x;
}

Upvotes: 22

Answers (9)

RARE Kpop Manifesto

Reputation: 2865

A VERY VERY lazy way to obtain an approximate starting point for # of bits that's a multiple of 3 ::

— The - 1 at the end is essential since the right-most digit (to the left of radix) is always the 0-th power of any base when expressed in big-endian notation

3 * ( str_len(sprintf("%o", x)) - 1 )

 e.g.        x :=        65 812 642 870 067 
       
         octal ==> 8 # 1 675 546 314 631 463
 string length ==> 16
  approx. bits ==> 45

45 is indeed correct, since this number is just slightly shy of 2 ^ 46

Yes it's a very lazy way, but might not be all that fast, since it still has to process the full octal for you, even at reduced precision.

Although my own adaptation of the Hacker's Delight algo above to awk doesn't directly use this sprintf() approach, I do a rapid scan of

   x  <  8 ^ 8  <=  x            ( 16,777,216 | 1 << 24 )

below it, just one more check for 

 x  <  8 ^ 4  <=  x               ( 4,096 | 1 << 12 )

at or above 8 ^ 8 check for

 x < 8 ^ 12 <=> x < 8 ^ 16 <= x ( 1 << 36 | 1 << 48 )

 within these 2, figure out which one is suitable for x,

 otherwise, just set it to the max of 60-bits, or 8^20

8's are very convenient in this regard since, by definition, any and every power of 8 is a power of 2 that's divisible by 3.

####################################

UPDATE 1 :: fully POSIX-compliant awk adaptation of algorithm above,

*1. looping by powers of 8 instead of 2 simplifies the tracking of exponent ::: s -= 3 => --s

function returns a negative integer representing the real/non-complex cube root for negative inputs
the accepted range is just shy of +/- 159-bits
due to lack of native bit-shifting in awk, no division/modulo ops are used by the function
all coefficients, offsets, bases, and exponents required by the function are generated on the fly, eliminating the need to hard-code in "magic numbers"*

function further updated to automatically adjust for larger inputs up to +/- 159-bits, specifically,

( 2^53 - 1 ) ^ 3

:=  730750818665451215712927172538123444058715062271

             170141183460469231731687303715884105727 (Mp-127)
                   162259276829213363391578010288127 (Mp-107)

                         618970019642690137449562111 (Mp- 89)
                                 2305843009213693951 (Mp- 61)
                                          2147483647 (Mp- 31)

(nearby Mersenne Primes included for frame of reference)

function cuberoot(__, _, ___, ____, _____,
                  ______, _______, ________) {

    ## Sources :
    ##
    ## [-1-] "Hacker*s Delight"
    ## [-2-] stackoverflow.com/questions/4331820/
    ##                        integer-cube-root/76684580
    ##
    ##  __| input :: (sign agnostic) 159-bit integer, string or numeric
    ##    |           range (inclusive) : [ 1 - (2^53 - 1)^3,
    ##    |                                     (2^53 - 1)^3 - 1 ]
    ##    |
    ##    |--> noncomplex("real") integer cube-root
    ##

    _______ = length(__ = (_ = "")__)

    if (_ = (___ = +__) < !___)
        return \
        -cuberoot(_______ < ++_^_^_ ? -___ : substr(__, _))

    else if (_ = (__ += ____ = _) <= (___ = (_ += ++_)^++_))
        return __ \
            ? (__ == ___) + _ \
            : +__

    _____ = __ < (___ = (_ += ++_) * (_ += _))^___ \
                     ? _ + _ * (__ >= ___^_) : \
            __ <  ___^(_ + _)                   \
                     ? _ *(_ - (__ <  ___^(_ + ___))) \
                     : _ + _ * _ + !!_

    if (--_ * (_ + _) < _______++) {

       if ((_______ = ___^(_____ = _______)) < __)
           while ((_______ *= ___ + !++_____) < __)_

       if ((_______ = _*(_+_)*_) <= ++_____)
           _____ = --_______
    }

    ________  = ___
     _______  = _ + (___ = !--_)
           _ += _

    do  __ < (______ += (______ = ________^_____) * \
              ((____ *= _) + (___ += ___)) * _______) ||

        __ -= ______ + ! (\
                ____ += ___ + ++___)

    while (_____--)

    return ___
}

    -723344818665451215712927172538123444058715062271

——> -8976667210094561

Despite not leveraging any big-int libraries, the answer is indeed spot on compared to the output of bc :

     8976667210094561.79249766353857714127663494315131
                      28935012202101375012266612716921 . . .

Upvotes: 0

jmd_dk

Reputation: 13110

Starting from the code within the GitHub gist from the answer of Fabian Giesen, I have arrived at the following, faster implementation:

static inline uint64_t icbrt(uint64_t x) {
  uint64_t b, y, one = 1, bits = 3*21;
  int s;
  for (s = bits - 3; s >= 0; s -= 3) {
    if ((x >> s) == 0)
      continue;
    x -= one << s;
    y = 1;
    for (s = s - 3; s >= 0; s -= 3) {
      y += y;
      b = 1 + 3*y*(y + 1);
      if ((x >> s) >= b) {
        x -= b << s;
        y += 1;
      }
    }
    return y;
  }
  return 0;
}

While the above is still somewhat slower than methods relying on the GNU specific __builtin_clzll, the above does not make use of compiler specifics and is thus completely portable.

The `bits` constant

Lowering the constant bits leads to faster computation, but the highest number x for which the function gives correct results is (1 << bits) - 1. Also, bits must be a multiple of 3 and be at most 64, meaning that its maximum value is really 3*21 == 63. With bits = 3*21, icbrt() thus works for input x <= 9223372036854775807. If we know that a program is working with limited x, say x < 1000000, then we can speed up the cube root computation by setting bits = 3*7, since (1 << 3*7) - 1 = 2097151 >= 1000000.

64-bit vs. 32-bit integers

Though the above is written for 64-bit integers, the logic is the same for 32-bit:

#include <stdint.h>

static inline uint32_t icbrt(uint32_t x) {
  uint32_t b, y, bits = 3*7;  /* or whatever is appropriate */
  int s;
  for (s = bits - 3; s >= 0; s -= 3) {
    if ((x >> s) == 0)
      continue;
    x -= 1 << s;
    y = 1;
    for (s = s - 3; s >= 0; s -= 3) {
      y += y;
      b = 1 + 3*y*(y + 1);
      if ((x >> s) >= b) {
        x -= b << s;
        y += 1;
      }
    }
    return y;
  }
  return 0;
}

Upvotes: 1

Michel

Reputation: 349

You can try and adapt this C algorithm :

#include <limits.h>

// return a number that, when multiplied by itself twice, makes N. 
unsigned cube_root(unsigned n){
    unsigned a = 0, b;
    for (int c = sizeof(unsigned) * CHAR_BIT / 3 * 3 ; c >= 0; c -= 3) {
        a <<= 1;
        b = a + (a << 1), b = b * a + b + 1 ;
        if (n >> c >= b)
            n -= b << c, ++a;
    }
    return a;
}

Also there is :

// return the number that was multiplied by itself to reach N.
unsigned square_root(const unsigned num) {
    unsigned a, b, c, d;
    for (b = a = num, c = 1; a >>= 1; ++c);
    for (c = 1 << (c & -2); c; c >>= 2) {
        d = a + c;
        a >>= 1;
        if (b >= d)
            b -= d, a += c;
    }
    return a;
}

Source

Upvotes: 0

Brett Hale

Reputation: 22328

I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.

Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.

Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.

The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)

static inline uint32_t u64_cbrt (uint64_t x)
{
    uint64_t r0 = 1, r1;

    /* IEEE-754 cbrt *may* not be exact. */

    if (x == 0) /* cbrt(0) : */
        return (0);

    int b = (64) - __builtin_clzll(x);
    r0 <<= (b + 2) / 3; /* ceil(b / 3) */

    do /* quadratic convergence: */
    {
        r1 = r0;
        r0 = (2 * r1 + x / (r1 * r1)) / 3;
    }
    while (r0 < r1);

    return ((uint32_t) r1); /* floor(cbrt(x)); */
}

A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.

The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.

It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.

Upvotes: 5

P_P

Reputation: 787

// On my pc: Math.Sqrt 35 ns, cbrt64 <70ns, cbrt32 <25 ns, (cbrt12 < 10ns)

// cbrt64(ulong x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt     (acbrt1)

// cbrt32(uint x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/icbrt.c.txt     (icbrt1)

// Union in C#:
// http://www.hanselman.com/blog/UnionsOrAnEquivalentInCSairamasTipOfTheDay.aspx

using System.Runtime.InteropServices;  
[StructLayout(LayoutKind.Explicit)]  
public struct fu_32   // float <==> uint
{
[FieldOffset(0)]
public float f;
[FieldOffset(0)]
public uint u;
}

private static uint cbrt64(ulong x)
{
    if (x >= 18446724184312856125) return 2642245;
    float fx = (float)x;
    fu_32 fu32 = new fu_32();
    fu32.f = fx;
    uint uy = fu32.u / 4;
    uy += uy / 4;
    uy += uy / 16;
    uy += uy / 256;
    uy += 0x2a5137a0;
    fu32.u = uy;
    float fy = fu32.f;
    fy = 0.33333333f * (fx / (fy * fy) + 2.0f * fy);
    int y0 = (int)                                      
        (0.33333333f * (fx / (fy * fy) + 2.0f * fy));    
    uint y1 = (uint)y0;                                 

    ulong y2, y3;
    if (y1 >= 2642245)
    {
        y1 = 2642245;
        y2 = 6981458640025;
        y3 = 18446724184312856125;
    }
    else
    {
        y2 = (ulong)y1 * y1;
        y3 = y2 * y1;
    }
    if (y3 > x)
    {
        y1 -= 1;
        y2 -= 2 * y1 + 1;
        y3 -= 3 * y2 + 3 * y1 + 1;
        while (y3 > x)
        {
            y1 -= 1;
            y2 -= 2 * y1 + 1;
            y3 -= 3 * y2 + 3 * y1 + 1;
        }
        return y1;
    }
    do
    {
        y3 += 3 * y2 + 3 * y1 + 1;
        y2 += 2 * y1 + 1;
        y1 += 1;
    }
    while (y3 <= x);
    return y1 - 1;
}

private static uint cbrt32(uint x)
{
    uint y = 0, z = 0, b = 0;
    int s = x < 1u << 24 ? x < 1u << 12 ? x < 1u << 06 ? x < 1u << 03 ? 00 : 03 :
                                                         x < 1u << 09 ? 06 : 09 :
                                          x < 1u << 18 ? x < 1u << 15 ? 12 : 15 :
                                                         x < 1u << 21 ? 18 : 21 :
                           x >= 1u << 30 ? 30 : x < 1u << 27 ? 24 : 27;
    do
    {
        y *= 2;
        z *= 4;
        b = 3 * y + 3 * z + 1 << s;
        if (x >= b)
        {
            x -= b;
            z += 2 * y + 1;
            y += 1;
        }
        s -= 3;
    }
    while (s >= 0);
    return y;
}

private static uint cbrt12(uint x) // x < ~255
{
    uint y = 0, a = 0, b = 1, c = 0;
    while (a < x)
    {
        y++;
        b += c;
        a += b;
        c += 6;
    }
    if (a != x) y--;
    return y;
}

Upvotes: 2

Fabian Giesen

Reputation: 3311

The book "Hacker's Delight" has algorithms for this and many other problems. The code is online here. EDIT: That code doesn't work properly with 64-bit ints, and the instructions in the book on how to fix it for 64-bit are somewhat confusing. A proper 64-bit implementation (including test case) is online here.

I doubt that your squareroot function works "correctly" - it should be ulong a for the argument, not n :) (but the same approach would work using cbrt instead of sqrt, although not all C math libraries have cube root functions).

Upvotes: 12

Keith Randall

Reputation: 23265

If pow is too expensive, you can use a count-leading-zeros instruction to get an approximation to the result, then use a lookup table, then some Newton steps to finish it.

int k = __builtin_clz(n); // counts # of leading zeros (often a single assembly insn)
int b = 64 - k;           // # of bits in n
int top8 = n >> (b - 8);  // top 8 bits of n (top bit is always 1)
int approx = table[b][top8 & 0x7f];

Given b and top8, you can use a lookup table (in my code, 8K entries) to find a good approximation to cuberoot(n). Use some Newton steps (see comingstorm's answer) to finish it.

Upvotes: 4

comingstorm

Reputation: 26117

You could try a Newton's step to fix your rounding errors:

ulong r = (ulong)pow(n, 1.0/3);
if(r==0) return r; /* avoid divide by 0 later on */
ulong r3 = r*r*r;
ulong slope = 3*r*r;

ulong r1 = r+1;
ulong r13 = r1*r1*r1;

/* making sure to handle unsigned arithmetic correctly */
if(n >= r13) r+= (n - r3)/slope;
if(n < r3)   r-= (r3 - n)/slope;

A single Newton step ought to be enough, but you may have off-by-one (or possibly more?) errors. You can check/fix those using a final check&increment step, as in your OQ:

while(r*r*r > n) --r;
while((r+1)*(r+1)*(r+1) <= n) ++r;

or some such.

(I admit I'm lazy; the right way to do it is to carefully check to determine which (if any) of the check&increment things is actually necessary...)

Upvotes: 3

Karl Knechtel

Reputation: 61617

I would research how to do it by hand, and then translate that into a computer algorithm, working in base 2 rather than base 10.

We end up with an algorithm something like (pseudocode):

Find the largest n such that (1 << 3n) < input.
result = 1 << n.
For i in (n-1)..0:
    if ((result | 1 << i)**3) < input:
        result |= 1 << i.

We can optimize the calculation of (result | 1 << i)**3 by observing that the bitwise-or is equivalent to addition, refactoring to result**3 + 3 * i * result ** 2 + 3 * i ** 2 * result + i ** 3, caching the values of result**3 and result**2 between iterations, and using shifts instead of multiplication.

Upvotes: 0

Integer cube root

Answers (9)

The bits constant

64-bit vs. 32-bit integers

Related Questions

The `bits` constant