dawg
dawg

Reputation: 103774

awk and gawk with large integers and large powers of 2

It was my understanding that both POSIX awk and GNU awk use IEEE 754 double for both integer and floats. (I know the -M switch is available on GNU awk for arbitrary precision integers. This question assumes without -M selected...)

This means that the max size of integer result with awk / gawk / perl (those without AUTOMATIC promotion to arbitrary precision integers) would be 53 bits since this is the max size integer that can fit in a IEEE 754 double. (At magnitudes greater than 2^53, you can no longer expect ±1 to work as it would with an integer but floating point arithmetic still works within the limits of a IEEE double.)

It seems to be easily demonstrated.

These work as expected with correct results (to the last digit) on both awk and gawk:

$ gawk 'BEGIN{print 2**52-1}'
4503599627370495
$ gawk 'BEGIN{print 2**52+1}'
4503599627370497
$ gawk 'BEGIN{print 2**53-1}'
9007199254740991

This is off by 1 (and is what I would expect with 53 bit max integer):

$ gawk 'BEGIN{print 2**53+1}'      # 9007199254740993 is the correct result
9007199254740992   

But here is what I would NOT expect. With certain power of 2 values both awk and GNU awk perform integer arithmetic at far greater precision than is possible within 53 bits.

(On my system, /usr/bin/awk is MacOS POSIX awk; gawk is GNU awk.)

Consider these examples, all precise to the digit:

$ gawk 'BEGIN{print 2**230}'  # float result with awk...
1725436586697640946858688965569256363112777243042596638790631055949824

$ /usr/bin/awk 'BEGIN{print 2**99}'   # max that POSIX awk supports
633825300114114700748351602688

The precision of ±1 is not supported at these magnitudes but limited arithmetic operations with powers of 2 are supported. Again, precise to the digit:

$ /usr/bin/awk 'BEGIN{print 2**99-2**98}'
316912650057057350374175801344    

$ /usr/bin/awk 'BEGIN{print 2**99+2**98}'
950737950171172051122527404032

$ gawk 'BEGIN{print 2**55-968}'  # 2^55=36028797018963968
36028797018963000

I am speculating that awk and gawk have some sort of non standard way of recognizing that 2^N is equivalent to 2<<N and doing some limited math inside of that arena.

Any form of [integer > 2] ^ Y with the result being greater than 2^53 has a drop in precision that is expected. ie, 10^15 is the rough max integer for ±1 to be accurate since 10^16 requires 54 bits.

$ gawk 'BEGIN{print 10**15+1}'  # correct
1000000000000001

$ gawk 'BEGIN{print 10**16+1}'  # not correct
10000000000000000

This is correct in magnitude for 10**64 but only precise for the first 16 digits (which I would expect):

$ gawk 'BEGIN{print 10**64}'
10000000000000001674705827425446886926697411428962669123675881472
# should be '1' + 64 '0'
# This is just a presentation issue of a value implying greater precision... 

The GNU document is not exactly helpful since it speaks of the max values for 64 bit unsigned and signed integers implying those are used somehow. But it is easy to demonstrate that with the exception of powers of 2, the max integer on gawk is 2**53

Questions:

  1. Am I correct that ALL integer calculations in awk / gawk are in fact IEEE doubles with max value of 2**53 for ±1? Is that documented somewhere?

  2. If that is correct, what is happening with larger powers of 2?

(It would be nice if there were automatic switching to float format (the way Perl does) at that magnitude where there is a loss of precision btw.)

Upvotes: 1

Views: 688

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2805

UPDATE : statement on powers of 10 in 754 double :

anectodally, even w/o bigint add-ons, if you're just looking for powers of 10 standalone, or using them to mod ( % ) against powers of 2, it appears u can go up to 10^22.

jot 25 | gawk -be '$++NF = sprintf("%.f", 10^$1)' # same for mawk+nawk

{ … pruned smaller ones… }

13  10000000000000
14  100000000000000
15  1000000000000000
16  10000000000000000

17  100000000000000000
18  1000000000000000000
19  10000000000000000000
20  100000000000000000000

21  1000000000000000000000
22  10000000000000000000000 <---------

23  99999999999999991611392
24  999999999999999983222784
25  10000000000000000905969664

( 2 ^ x ) % ( 10 ^ 22 )

   jot -w 'x=%d;y=22; print x,"=",y,"=",(2^x)%%(10^y),"\n"; ' - 3 1023 68 | 
   bc      | 
   mawk2 '$++NF = sprintf("\f\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"\
                                        "\b\b\b\b\b\b\b\b%.f",
                   ( 2 ^ $1 ) % ( 10 ^ $2 ) )'    FS== OFS== |
   column -s= -t | gcat -n

All the powers of 2, even up to 1023, can directly obtain the exact modulo of 10 ^ 22 w/o needing specialized algorithms, string ops, or bigint packages, as confirmed by gnu-bc

 1  3     22  8                       
              8

 2  71    22  2361183241434822606848  
              2361183241434822606848

 3  139   22  2991196020261297061888  
              2991196020261297061888

 4  207   22  1983204197482918576128  
              1983204197482918576128

 5  275   22  7804340758912662765568  
              7804340758912662765568

 6  343   22  3047019946172856926208  
              3047019946172856926208

 7  411   22  729696200186866434048   
              729696200186866434048

 8  479   22  3987839644142653145088  
              3987839644142653145088

 9  547   22  1954520592778765795328  
              1954520592778765795328

10  615   22  5333860664072122400768  
              5333860664072122400768

11  683   22  3741475809259744657408  
              3741475809259744657408

12  751   22  1513586051123404341248  
              1513586051123404341248

13  819   22  2458937421726941708288  
              2458937421726941708288

14  887   22  8118143630040815894528  
              8118143630040815894528

15  955   22  3346307143437247315968  
              3346307143437247315968

16  1023  22  2417678164812112068608  
              2417678164812112068608

=============

not for all powers of 2. as you said, It's based on limitations of IEEE 754 double format. The highest i can get it to spit out is 2^1023.

2^1024 results in INF unless you invoke bignum mode -M

That said, gap begins to jump past 2^53, and increases along the way (as you go further into so-called "sub-optimal" range. As for printing those out, %d / %i is good for +/- 63-bits in gawk/mawk2, and %u up to unsigned 64-bits int (but potentially imprecise other than exact powers of 2 once you're past 2^53).

pre-mid-2020 mawk 1.3.4 appears to be limited to 31/32-bits instead respectively. A change has been made around then to use 64-bit ints for s/printf caps.

Past those ranges, %.f is just about the only way to go.

Upvotes: 1

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2805

one special case about powers of two -

if you want just 2^N-1 for powers up to 1023, a very clean sub() wil do the trick without having to physically go figure out what the last digit is :

sub(/[2468]$/, index("1:2:0:3", bits % 4), pow2str) 

the last digit for positive integer powers of 2 have this repeat and predictable pattern when you take modulo against 4,

so using this specially crafted string, where the values exist at positions 7/3/1/5 (in descending modulo order), the string index itself is already exact last digit minus 1.

e.g. 2^719 : it goes 275. . . . 60288
                                    |                         |
719 % 4 = 3, located at position 7 of reference string "1:2:0:3",

so the regex replaces the final "8" with a "7", giving you exactly 2^N-1 for any gigantic integer power of 2.

if you already know how many bits this power of 2 is supposed to be, then this way is faster, otherwise, a substring replacement approach is definitely faster than running it through logarithmic functions.

Upvotes: 0

Eric Postpischil
Eric Postpischil

Reputation: 222437

I cannot speak to the numeric implementations used in particular versions of gawk or awk. This answer speaks to floating-point generally, particularly IEEE-754 binary formats.

Computing 299 for 2**99 and 2230 for 2**230 are simply normal operations for floating-point arithmetic. Each is represented with a significand with one significant binary digit, 1, and an exponent of 99 or 230. Whatever routine is used to implement the exponentiation operation is presumably doing its job correctly. Since binary floating-point represents a number using a sign, a significand, and a scaling of two to some power, 299 and 2230 are easily represented.

When these numbers are printed, some routine is called to convert them to decimal numerals. This routine also appears to be well implemented, producing correct output. Some work is required to do that conversion correctly, as implementing it with naïve arithmetic will introduce rounding errors that produce incorrect results. (Sometimes little engineering effort is given to conversion routines and they produce results accurate only to a limited number of significant decimal digits. This appears to be less common; correctly rounded implementations are more common now than they used to be.)

Apparent “loss of precision,” more accurately called “loss of accuracy” or “rounding errors,” occurs when results cannot be exactly implemented (such as 253+1) or when floating-point operations are implemented without correct rounding. For 299 and 2230, no such loss is imposed by the floating-point format.

This means that the max size of integer result with awk / gawk / perl… would be 53 bits… ”

This is incorrect, or at least incorrectly phrased. The last consecutive integer that can be represented in IEEE-754 64-bit binary is 253. But it is certainly not the maximum. 253+2 can also be represented, having skipped 253+1. There are many more integers larger than 253 that can be represented.

Upvotes: 3

Related Questions