How to accomplish this byte munging in perl?

Question

Background:

I'm trying to use the perl script from here to decrypt an android backup. Unfortunately, the checksum validation fails.

After playing around with this (Python) script, the problem seems to be that I need to do some additional munging of the master key (n.b. masterKeyJavaConversion in the Python script).

Problem:

I need to take a bag of bytes and perform the following conversion steps:

Sign-extend from signed char to signed short
Convert the result from UTF16 (BE?) to UTF-8

For example (all bytes are in hex):

3x → 3x
7x → 7x
ax -> ef be ax
bx -> ef be bx
cx -> ef bf 8x
dx -> ef bf 9x
ex -> ef bf ax
fx -> ef bf bx

(The x always remains unchanged.)

More specifically, given a bit sequence 1abc defg, I need to output 1110 1111 1011 111a 10bc defg. (For 0abc defg, the output is just 0abc defg, i.e. unchanged.)

Answers may use UTF conversions or may do the bit twiddling directly; I don't care, as long as it works (this isn't performance-critical). Answers in the form of a subroutine are ideal. (My main problem is I know just enough Perl to be dangerous. If this was C/C++, I wouldn't need help, but it would be a major undertaking to rewrite the entire script in another language, or to modify the Python script to not need to read the entire input into memory.)

ikegami · Accepted Answer

1110 1111 1011 111a 10bc defg would be a valid UTF-8 encoding.

++++-------------------------- Start of three byte sequence
||||     ++------------------- Continuation byte
||||     ||       ++---------- Continuation byte
||||     ||       ||
11101111 1011111a 10bcdefg
    ||||   ||||||   ||||||
    ++++---++++++---++++++---- 1111 1111 1abc defg

That's just the extension of an 8-bit signed number to 16 bits, cast to unsigned, and treated as a Unicode Code Point.

So, without looking at the code, I think you want

sub encode_utf8 { 
   my ($s) = @_;
   utf8::encode($s);
   return $s;
}

sub munge {
   return
      encode_utf8                # "\x30\x70\xEF\xBE\xA0..."
         pack 'W*',              # "\x{0030}\x{0x0070}\x{0xFFA0}..."
            unpack 'S*',         # 0x0030, 0x0070, 0xFFA0, ...
               pack 's*',        # "\x30\x00\x70\x00\xA0\xFF..." (on a LE machine)
                  unpack 'c*',   # 48, 112, -96, ...
                     $_[0];      # "\x30\x70\xA0..."
}

my $s = "\x30\x70\xA0\xB0\xC0\xD0\xE0\xF0";
my $munged = munge($s);

If you remove the comments, you get the following:

sub munge {
   my $s = pack 'W*', unpack 'S*', pack 's*', unpack 'c*', $_[0];
   utf8::encode($s);
   return $s;
}

Here's a much faster solution:

my @map = (
   ( map chr($_),            0x00..0x7F ),
   ( map "\xEF\xBE".chr($_), 0x80..0xBF ),
   ( map "\xEF\xBF".chr($_), 0xC0..0xFF ),
);

sub munge { join '', @map[ unpack 'C*', $_[0] ] }

How to accomplish this byte munging in perl?

Background:

Problem:

Answers (2)

Related Questions