Reputation: 17270
I need to decode a lot of base64-encoded text strings in awk
; because I don't want to massively fork a non-portable base64
binary, I wrote an awk
function for doing the decoding:
function base64_decode(str, out,i,n,v) {
out = ""
if ( ! ("A" in _BASE64_DECODE_c2i) )
for (i = 1; i <= 64; i++)
_BASE64_DECODE_c2i[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1
i = 0
n = length(str)
while (i <= n) {
v = _BASE64_DECODE_c2i[substr(str,++i,1)] * 262144 + \
_BASE64_DECODE_c2i[substr(str,++i,1)] * 4096 + \
_BASE64_DECODE_c2i[substr(str,++i,1)] * 64 + \
_BASE64_DECODE_c2i[substr(str,++i,1)]
out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
}
return out
}
Which works fine:
printf '%s\n' SmFuZQ== amRvZQ== |
LANG=C command -p awk '
{ print base64_decode($0) }
function base64_decode(...) {...} # placeholder for the real function
'
Jane
jdoe
I' trying to get the givenName
of the users that are members of GroupCode
= 025496
from the output of ldapsearch -LLL -o ldif-wrap=no ... '(|(uid=*)(GroupCode=*))' uid givenName GroupCode memberUid
, for example:
dn: uid=jsmith,ou=users,dc=example,dc=com
givenName: John
uid: jsmith
dn: uid=jdoe,ou=users,dc=example,dc=com
uid: jdoe
givenName:: SmFuZQ==
dn: cn=group1,ou=groups,dc=example,dc=com
GroupCode: 025496
memberUid:: amRvZQ==
memberUid: jsmith
Here would be an awk
for doing so:
LANG=C command -p awk -F '\n' -v RS='' -v GroupCode=025496 '
{
delete attrs
for (i = 2; i <= NF; i++) {
match($i,/::? /)
key = substr($i,1,RSTART-1)
val = substr($i,RSTART+RLENGTH)
if (RLENGTH == 3)
val = base64_decode(val)
attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
}
if ( /\nuid:/ )
givenName[ attrs["uid"] ] = attrs["givenName"]
else
memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
}
END {
n = split(memberUid[GroupCode],uid,SUBSEP)
for ( i = 1; i <= n; i++ )
print givenName[ uid[i] ]
}
function base64_decode(...) { ... } # placeholder for the real function
'
On BSD and Solaris the result is:
Jane
John
While on Linux it is:
John
I don't understand where the issue is. Is there something wrong with the base64_decode
function and/or the code that uses it?
Upvotes: 2
Views: 573
Reputation: 17270
This answer is for reference
Here's a working base64_decode
function (thanks @MNejatAydin for pointing out the issue(s) in the original one):
function base64_decode(str, out,bits,n,i,c1,c2,c3,c4) {
out = ""
# One-time initialization during the first execution
if ( ! ("A" in _BASE64) )
for (i = 1; i <= 64; i++)
# The "_BASE64" array associates a character to its base64 index
_BASE64[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1
# Decoding the input string
n = length(str)
i = 0
while ( i < n ) {
c1 = substr(str, ++i, 1)
c2 = substr(str, ++i, 1)
c3 = substr(str, ++i, 1)
c4 = substr(str, ++i, 1)
bits = _BASE64[c1] * 262144 + _BASE64[c2] * 4096 + _BASE64[c3] * 64 + _BASE64[c4]
if ( c4 != "=" )
out = out sprintf("%c%c%c", bits/65536, bits/256, bits)
else if ( c3 != "=" )
out = out sprintf("%c%c", bits/65536, bits/256)
else
out = out sprintf("%c", bits/65536)
}
return out
}
WARNING: the function requires LANG=C
It also doesn't check that the input is a valid base64 string; for that you can add a simple condition like:
match( str, "^([a-zA-Z/-9+]{4})*([a-zA-Z/-9+]{2}[a-zA-Z/-9+=]{2})?$" )
Interestingly, the code is 2x faster than base64decode.awk, but it's only 3x faster than forking the base64
binary from inside awk
.
notes:
In a base64
encoded string, 4 bytes represent 3 bytes of data; the input have to be processed by groups of 4 characters.
Multiplying and dividing an integer by a power of two is equivalent to do bitwise left and right shifts operations.
262144
is 2^18
, so N * 262144
is equivalent to N << 18
4096
is 2^12
, so N * 4096
is equivalent to N << 12
64
id 2^6
, so N * 4096
is equivalent to N << 6
65536
is 2^16
, so N / 65536
(integer division) is equivalent to N >> 16
256
is 2^8
, so N / 256
(integer division) is equivalent to N >> 8
What happens in printf "%c", N
:
N
is first converted to an integer (if need be) and then, WITH LANG=C
, the 8 least significant bits are taken in for the %c
formatting.
Handling of the potential padding at the end of the encoded string:
If the 4th char isn't =
(i.e. there's no padding) then the result should be 3 bytes of data.
If the 4th char is =
and the 3rd char isn't =
then there's 2 bytes of of data to decode.
If the fourth char is =
and the third char is =
then there's only one byte of data.
Upvotes: 1
Reputation: 10133
Your function generates NUL bytes when its argument (encoded string) ends with padding characters (=
s). Below is a corrected version of your while
loop:
while (i < n) {
v = _BASE64_DECODE_c2i[substr(str,1+i,1)] * 262144 + \
_BASE64_DECODE_c2i[substr(str,2+i,1)] * 4096 + \
_BASE64_DECODE_c2i[substr(str,3+i,1)] * 64 + \
_BASE64_DECODE_c2i[substr(str,4+i,1)]
i += 4
if (v%256 != 0)
out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
else if (int(v/256)%256 != 0)
out = out sprintf("%c%c", int(v/65536), int(v/256))
else
out = out sprintf("%c", int(v/65536))
}
Note that if the decoded bytes contains an embedded NUL then this approach may not work properly.
Upvotes: 6
Reputation: 786031
Problem is within base64_decode
function that outputs some junk characters on gnu-awk.
You can use this awk code that uses system provided base64
utility as an alternative:
{
delete attrs
for (i = 2; i <= NF; i++) {
match($i,/::? /)
key = substr($i,1,RSTART-1)
val = substr($i,RSTART+RLENGTH)
if (RLENGTH == 3) {
cmd = "echo " val " | base64 -di"
cmd | getline val # should also check exit code here
}
attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
}
if ( /\nuid:/ )
givenName[ attrs["uid"] ] = attrs["givenName"]
else
memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
}
END {
n = split(memberUid[GroupCode],uid,SUBSEP)
for ( i = 1; i <= n; i++ )
print givenName[ uid[i] ]
}
I have tested this on gnu and BSD awk versions and I am getting expected output in all the cases.
If you cannot use external base64
utility then I suggest you take a look here for awk version of base64 decode.
Upvotes: 3