Reputation: 103774
It was my understanding that both POSIX awk and GNU awk use IEEE 754 double for both integer and floats. (I know the -M
switch is available on GNU awk for arbitrary precision integers. This question assumes without -M
selected...)
This means that the max size of integer result with awk / gawk / perl (those without AUTOMATIC promotion to arbitrary precision integers) would be 53 bits since this is the max size integer that can fit in a IEEE 754 double. (At magnitudes greater than 2^53, you can no longer expect ±1 to work as it would with an integer but floating point arithmetic still works within the limits of a IEEE double.)
It seems to be easily demonstrated.
These work as expected with correct results (to the last digit) on both awk and gawk:
$ gawk 'BEGIN{print 2**52-1}'
4503599627370495
$ gawk 'BEGIN{print 2**52+1}'
4503599627370497
$ gawk 'BEGIN{print 2**53-1}'
9007199254740991
This is off by 1 (and is what I would expect with 53 bit max integer):
$ gawk 'BEGIN{print 2**53+1}' # 9007199254740993 is the correct result
9007199254740992
But here is what I would NOT expect. With certain power of 2 values both awk and GNU awk perform integer arithmetic at far greater precision than is possible within 53 bits.
(On my system, /usr/bin/awk
is MacOS POSIX awk; gawk
is GNU awk.)
Consider these examples, all precise to the digit:
$ gawk 'BEGIN{print 2**230}' # float result with awk...
1725436586697640946858688965569256363112777243042596638790631055949824
$ /usr/bin/awk 'BEGIN{print 2**99}' # max that POSIX awk supports
633825300114114700748351602688
The precision of ±1 is not supported at these magnitudes but limited arithmetic operations with powers of 2 are supported. Again, precise to the digit:
$ /usr/bin/awk 'BEGIN{print 2**99-2**98}'
316912650057057350374175801344
$ /usr/bin/awk 'BEGIN{print 2**99+2**98}'
950737950171172051122527404032
$ gawk 'BEGIN{print 2**55-968}' # 2^55=36028797018963968
36028797018963000
I am speculating that awk and gawk have some sort of non standard way of recognizing that 2^N
is equivalent to 2<<N
and doing some limited math inside of that arena.
Any form of [integer > 2] ^ Y
with the result being greater than 2^53 has a drop in precision that is expected. ie, 10^15 is the rough max integer for ±1 to be accurate since 10^16 requires 54 bits.
$ gawk 'BEGIN{print 10**15+1}' # correct
1000000000000001
$ gawk 'BEGIN{print 10**16+1}' # not correct
10000000000000000
This is correct in magnitude for 10**64 but only precise for the first 16 digits (which I would expect):
$ gawk 'BEGIN{print 10**64}'
10000000000000001674705827425446886926697411428962669123675881472
# should be '1' + 64 '0'
# This is just a presentation issue of a value implying greater precision...
The GNU document is not exactly helpful since it speaks of the max values for 64 bit unsigned and signed integers implying those are used somehow. But it is easy to demonstrate that with the exception of powers of 2, the max integer on gawk is 2**53
Questions:
Am I correct that ALL integer calculations in awk / gawk are in fact IEEE doubles with max value of 2**53 for ±1? Is that documented somewhere?
If that is correct, what is happening with larger powers of 2?
(It would be nice if there were automatic switching to float format (the way Perl does) at that magnitude where there is a loss of precision btw.)
Upvotes: 1
Views: 688
Reputation: 2805
UPDATE : statement on powers of 10 in 754 double :
anectodally, even w/o bigint add-ons, if you're just looking for powers of 10 standalone, or using them to
mod ( % )
against powers of 2, it appears u can go up to 10^22.
jot 25 | gawk -be '$++NF = sprintf("%.f", 10^$1)' # same for mawk+nawk
{ … pruned smaller ones… }
13 10000000000000
14 100000000000000
15 1000000000000000
16 10000000000000000
17 100000000000000000
18 1000000000000000000
19 10000000000000000000
20 100000000000000000000
21 1000000000000000000000
22 10000000000000000000000 <---------
23 99999999999999991611392
24 999999999999999983222784
25 10000000000000000905969664
( 2 ^ x ) % ( 10 ^ 22 )
jot -w 'x=%d;y=22; print x,"=",y,"=",(2^x)%%(10^y),"\n"; ' - 3 1023 68 |
bc |
mawk2 '$++NF = sprintf("\f\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"\
"\b\b\b\b\b\b\b\b%.f",
( 2 ^ $1 ) % ( 10 ^ $2 ) )' FS== OFS== |
column -s= -t | gcat -n
All the powers of 2, even up to 1023, can directly obtain the exact modulo of 10 ^ 22
w/o needing specialized algorithms, string ops, or bigint packages, as confirmed by gnu-bc
1 3 22 8
8
2 71 22 2361183241434822606848
2361183241434822606848
3 139 22 2991196020261297061888
2991196020261297061888
4 207 22 1983204197482918576128
1983204197482918576128
5 275 22 7804340758912662765568
7804340758912662765568
6 343 22 3047019946172856926208
3047019946172856926208
7 411 22 729696200186866434048
729696200186866434048
8 479 22 3987839644142653145088
3987839644142653145088
9 547 22 1954520592778765795328
1954520592778765795328
10 615 22 5333860664072122400768
5333860664072122400768
11 683 22 3741475809259744657408
3741475809259744657408
12 751 22 1513586051123404341248
1513586051123404341248
13 819 22 2458937421726941708288
2458937421726941708288
14 887 22 8118143630040815894528
8118143630040815894528
15 955 22 3346307143437247315968
3346307143437247315968
16 1023 22 2417678164812112068608
2417678164812112068608
=============
not for all powers of 2. as you said, It's based on limitations of IEEE 754 double
format. The highest i can get it to spit out is 2^1023
.
2^1024
results in INF unless you invoke bignum mode -M
That said, gap begins to jump past 2^53, and increases along the way (as you go further into so-called "sub-optimal" range. As for printing those out, %d / %i is good for +/- 63-bits in gawk/mawk2
, and %u up to unsigned 64-bits int
(but potentially imprecise other than exact powers of 2 once you're past 2^53).
pre-mid-2020 mawk 1.3.4 appears to be limited to 31/32-bits instead respectively. A change has been made around then to use 64-bit ints for s/printf
caps.
Past those ranges, %.f is just about the only way to go.
Upvotes: 1
Reputation: 2805
one special case about powers of two -
if you want just 2^N-1 for powers up to 1023, a very clean sub() wil do the trick without having to physically go figure out what the last digit is :
sub(/[2468]$/, index("1:2:0:3", bits % 4), pow2str)
the last digit for positive integer powers of 2 have this repeat and predictable pattern when you take modulo against 4,
so using this specially crafted string, where the values exist at positions 7/3/1/5 (in descending modulo order), the string index itself is already exact last digit minus 1.
e.g. 2^719 : it goes 275. . . . 60288
| |
719 % 4 = 3, located at position 7 of reference string "1:2:0:3",
so the regex replaces the final "8" with a "7", giving you exactly 2^N-1 for any gigantic integer power of 2.
if you already know how many bits this power of 2 is supposed to be, then this way is faster, otherwise, a substring replacement approach is definitely faster than running it through logarithmic functions.
Upvotes: 0
Reputation: 222437
I cannot speak to the numeric implementations used in particular versions of gawk or awk. This answer speaks to floating-point generally, particularly IEEE-754 binary formats.
Computing 299 for 2**99
and 2230 for 2**230
are simply normal operations for floating-point arithmetic. Each is represented with a significand with one significant binary digit, 1, and an exponent of 99 or 230. Whatever routine is used to implement the exponentiation operation is presumably doing its job correctly. Since binary floating-point represents a number using a sign, a significand, and a scaling of two to some power, 299 and 2230 are easily represented.
When these numbers are printed, some routine is called to convert them to decimal numerals. This routine also appears to be well implemented, producing correct output. Some work is required to do that conversion correctly, as implementing it with naïve arithmetic will introduce rounding errors that produce incorrect results. (Sometimes little engineering effort is given to conversion routines and they produce results accurate only to a limited number of significant decimal digits. This appears to be less common; correctly rounded implementations are more common now than they used to be.)
Apparent “loss of precision,” more accurately called “loss of accuracy” or “rounding errors,” occurs when results cannot be exactly implemented (such as 253+1) or when floating-point operations are implemented without correct rounding. For 299 and 2230, no such loss is imposed by the floating-point format.
This means that the max size of integer result with awk / gawk / perl… would be 53 bits… ”
This is incorrect, or at least incorrectly phrased. The last consecutive integer that can be represented in IEEE-754 64-bit binary is 253. But it is certainly not the maximum. 253+2 can also be represented, having skipped 253+1. There are many more integers larger than 253 that can be represented.
Upvotes: 3