tampe125
tampe125

Reputation: 8543

XOR on Unicode strings in Python 2.7

I am trying to decode an obfuscated Android app. After decompiling it, I can see that several strings are obfuscated in this way:

static char[] java_decode(char[] cArr, char[] cArr2) {
        int i = 0;
        for (int i2 = 0; i2 < cArr.length; i2++) {
            cArr[i2] = (char) (cArr2[i] ^ cArr[i2]);
            i++;
            if (i >= cArr2.length) {
                i = 0;
            }
        }
        return cArr;
    }
str2 = new String(epgwmgrwgrdvzck("浶㫻ᒍ夓䌎箜湛泰Ⳮ䯣倝".toCharArray(), new char[]{'浌', '㫛', 'ᓮ', '奼', '䍻', '篲', '港', '沂', 'Ⲕ', '䯃', '倧'})).intern();
# java.lang.String str2 = ": country :"

For a better understanding and for a quick review, I'd wish to change all those strings into the plain one; I chose Python since it's just a quick and fast scripting language.
Sadly I'm having some hard times with those multi-byte chars. This is the function I tried to write:

# coding=utf-8

def decode(string1, string2):
    string1 = list(string1)

    i = 0
    i2 = 0

    while i2 < len(string1):
        string1[i2] = chr(ord(string2[i]) ^ ord(string1[i2]))

        i += 1

        if i >= len(string2):
            i = 0

        i2 += 1

    string1 = str("".join(string1))    
    print string1

    return string1

decode("浶㫻ᒍ夓䌎箜湛泰Ⳮ䯣倝", ['浌', '㫛', 'ᓮ', '奼', '䍻', '篲', '港', '沂', 'Ⲕ', '䯃', '倧'])
# TypeError: ord() expected a character, but string of length 3 found

The main problem here is that ord() only accepts one character at time, while those strings are made of multi-bytes chars.
Any suggestions on how work around this issue?

I'm using Python 2.7.11 |Anaconda 4.0.0 (x86_64). I know Python 3 has a far better Unicode support than Python 2; if the solution only works in Python 3, I can use it without problems, as it's just a one-time script.

Upvotes: 1

Views: 1177

Answers (1)

Your code works as is (excepting that you need to change print string1 to print(string1) in Python 3; the output and return value is the string : country :.

However, this doesn't work in Python 2, because in Python 2 the strings aren't unicode; you'd need to prefix all unicode string literals with u, i.e. u'浌' / alternatively you need to use from __future__ import unicode_literals to make '' create unicode literals in Python 2; and also chr converts a value into a 8-bit string value (i.e. a byte), not a unicode character.


FWIW, the code could be easier written in Python 3 as

from itertools import cycle

def decode(s1, s2):
    return ''.join([
        chr(ord(c1) ^ ord(c2))
        for c1, c2 in
        zip(s1, cycle(s2))
    ])

result = decode("浶㫻ᒍ夓䌎箜湛泰Ⳮ䯣倝",
                ['浌', '㫛', 'ᓮ', '奼', '䍻', '篲', '港', '沂', 'Ⲕ', '䯃', '倧'])

print(result)  # prints ": country :"

First of all it seems that the java code allows the second array to be shorter than the first, and that in this case its value is repeated; in Python we can use itertools.cycle to achieve this effect more efficiently. We use zip to pair the corresponding values from the input arrays, and use a list comprehension to build the list that will be given to ''.join.

This code can work with some minor modifications in Python 2, by adding from __future__ import unicode_literals and changing chr to unichr:

from __future__ import unicode_literals, print_function
from itertools import cycle

def decode(s1, s2):
    return ''.join([
        unichr(ord(c1) ^ ord(c2))
        for c1, c2 in
        zip(s1, cycle(s2))
    ])

result = decode("浶㫻ᒍ夓䌎箜湛泰Ⳮ䯣倝",
                ['浌', '㫛', 'ᓮ', '奼', '䍻', '篲', '港', '沂', 'Ⲕ', '䯃', '倧'])

print(result)  # prints ": country :"

Upvotes: 2

Related Questions