Fravadona
Fravadona

Reputation: 17270

decoding base64 encoded text with POSIX awk

I need to decode a lot of base64-encoded text strings in awk; because I don't want to massively fork a non-portable base64 binary, I wrote an awk function for doing the decoding:

function base64_decode(str,    out,i,n,v) {
    out = ""
    if ( ! ("A" in _BASE64_DECODE_c2i) )
        for (i = 1; i <= 64; i++)
            _BASE64_DECODE_c2i[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1
    i = 0
    n = length(str)
    while (i <= n) {
        v = _BASE64_DECODE_c2i[substr(str,++i,1)] * 262144 + \
            _BASE64_DECODE_c2i[substr(str,++i,1)] * 4096 + \
            _BASE64_DECODE_c2i[substr(str,++i,1)] * 64 + \
            _BASE64_DECODE_c2i[substr(str,++i,1)]
        out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
    }
    return out
}

Which works fine:

printf '%s\n' SmFuZQ== amRvZQ== |

LANG=C command -p awk '
    { print base64_decode($0) }
    function base64_decode(...) {...} # placeholder for the real function
'
Jane
jdoe

CONTEXTUAL PROBLEM

I' trying to get the givenName of the users that are members of GroupCode = 025496 from the output of ldapsearch -LLL -o ldif-wrap=no ... '(|(uid=*)(GroupCode=*))' uid givenName GroupCode memberUid, for example:

dn: uid=jsmith,ou=users,dc=example,dc=com
givenName: John
uid: jsmith

dn: uid=jdoe,ou=users,dc=example,dc=com
uid: jdoe
givenName:: SmFuZQ==

dn: cn=group1,ou=groups,dc=example,dc=com
GroupCode: 025496
memberUid:: amRvZQ==
memberUid: jsmith

Here would be an awk for doing so:

LANG=C command -p awk -F '\n' -v RS='' -v GroupCode=025496 '
    {
        delete attrs
        for (i = 2; i <= NF; i++) {
            match($i,/::? /)
            key = substr($i,1,RSTART-1)
            val = substr($i,RSTART+RLENGTH)
            if (RLENGTH == 3)
                val = base64_decode(val)
            attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
        }
        if ( /\nuid:/ )
            givenName[ attrs["uid"] ] = attrs["givenName"]
        else
            memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
    }
    END {
        n = split(memberUid[GroupCode],uid,SUBSEP)
        for ( i = 1; i <= n; i++ )
            print givenName[ uid[i] ]
    }

    function base64_decode(...) { ... } # placeholder for the real function
'

On BSD and Solaris the result is:

Jane
John

While on Linux it is:


John

I don't understand where the issue is. Is there something wrong with the base64_decode function and/or the code that uses it?

Upvotes: 2

Views: 573

Answers (3)

Fravadona
Fravadona

Reputation: 17270

This answer is for reference

Here's a working base64_decode function (thanks @MNejatAydin for pointing out the issue(s) in the original one):

function base64_decode(str,    out,bits,n,i,c1,c2,c3,c4) {
    out = ""

    # One-time initialization during the first execution
    if ( ! ("A" in _BASE64) )
        for (i = 1; i <= 64; i++)
            # The "_BASE64" array associates a character to its base64 index
            _BASE64[substr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",i,1)] = i-1

    # Decoding the input string
    n = length(str)
    i = 0
    while ( i < n ) {
        c1 = substr(str, ++i, 1)
        c2 = substr(str, ++i, 1)
        c3 = substr(str, ++i, 1)
        c4 = substr(str, ++i, 1)

        bits = _BASE64[c1] * 262144 + _BASE64[c2] * 4096 + _BASE64[c3] * 64 + _BASE64[c4]

        if ( c4 != "=" )
            out = out sprintf("%c%c%c", bits/65536, bits/256, bits)
        else if ( c3 != "=" )
            out = out sprintf("%c%c", bits/65536, bits/256)
        else
            out = out sprintf("%c", bits/65536)
    }

    return out
}

WARNING: the function requires LANG=C

It also doesn't check that the input is a valid base64 string; for that you can add a simple condition like:

match( str, "^([a-zA-Z/-9+]{4})*([a-zA-Z/-9+]{2}[a-zA-Z/-9+=]{2})?$" )

Interestingly, the code is 2x faster than base64decode.awk, but it's only 3x faster than forking the base64 binary from inside awk.


notes:

  1. In a base64 encoded string, 4 bytes represent 3 bytes of data; the input have to be processed by groups of 4 characters.

  2. Multiplying and dividing an integer by a power of two is equivalent to do bitwise left and right shifts operations.

    • 262144 is 2^18, so N * 262144 is equivalent to N << 18
    • 4096 is 2^12, so N * 4096 is equivalent to N << 12
    • 64 id 2^6, so N * 4096 is equivalent to N << 6
    • 65536 is 2^16, so N / 65536 (integer division) is equivalent to N >> 16
    • 256 is 2^8, so N / 256 (integer division) is equivalent to N >> 8
  3. What happens in printf "%c", N:

    N is first converted to an integer (if need be) and then, WITH LANG=C, the 8 least significant bits are taken in for the %c formatting.

  4. Handling of the potential padding at the end of the encoded string:

    • If the 4th char isn't = (i.e. there's no padding) then the result should be 3 bytes of data.

    • If the 4th char is = and the 3rd char isn't = then there's 2 bytes of of data to decode.

    • If the fourth char is = and the third char is = then there's only one byte of data.


Upvotes: 1

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10133

Your function generates NUL bytes when its argument (encoded string) ends with padding characters (=s). Below is a corrected version of your while loop:

while (i < n) {
    v = _BASE64_DECODE_c2i[substr(str,1+i,1)] * 262144 + \
        _BASE64_DECODE_c2i[substr(str,2+i,1)] * 4096 + \
        _BASE64_DECODE_c2i[substr(str,3+i,1)] * 64 + \
        _BASE64_DECODE_c2i[substr(str,4+i,1)]
    i += 4
    if (v%256 != 0)
        out = out sprintf("%c%c%c", int(v/65536), int(v/256), v)
    else if (int(v/256)%256 != 0)
        out = out sprintf("%c%c", int(v/65536), int(v/256))
    else
        out = out sprintf("%c", int(v/65536))
}

Note that if the decoded bytes contains an embedded NUL then this approach may not work properly.

Upvotes: 6

anubhava
anubhava

Reputation: 786031

Problem is within base64_decode function that outputs some junk characters on gnu-awk.

You can use this awk code that uses system provided base64 utility as an alternative:

{
   delete attrs
   for (i = 2; i <= NF; i++) {
      match($i,/::? /)
      key = substr($i,1,RSTART-1)
      val = substr($i,RSTART+RLENGTH)
      if (RLENGTH == 3) {
         cmd = "echo " val " | base64 -di"
         cmd | getline val   # should also check exit code here
      }
      attrs[key] = ((key in attrs) ? attrs[key] SUBSEP val : val)
   }
   if ( /\nuid:/ )
      givenName[ attrs["uid"] ] = attrs["givenName"]
   else
      memberUid[ attrs["GroupCode"] ] = attrs["memberUid"]
}
END {
   n = split(memberUid[GroupCode],uid,SUBSEP)
   for ( i = 1; i <= n; i++ )
      print givenName[ uid[i] ]
}

I have tested this on gnu and BSD awk versions and I am getting expected output in all the cases.

If you cannot use external base64 utility then I suggest you take a look here for awk version of base64 decode.

Upvotes: 3

Related Questions