How to compress text length using Unicode and bases, and be able to revert it back?

Question

There are 44 characters in this text: the quick brown fox jumped over the lazy dog

The same text can be represented with just 11 Unicode characters: 񜥎񐟾𴬔񇒉𚫔𮹂𓻣񥯨񜥎𵁼񽤙
(These characters look the same as "[]" but they are all different characters!)

This is because the ASCII characters can range from 1-27 (base 27 if you only use the 27 characters in this character set abcdefghijklmnopqrstuvwxyz ) and Unicode characters range from 1-1114112, which means you can store multiple numbers in a bigger number if you do indices-related math.

For example, the text this looks like [19, 7, 8, 18] if you convert each character to their index in the above base 27 character set. If you do the calculation below:

19 x 27 ^ 0 +  
7  x 27 ^ 1 +  
8  x 27 ^ 2 +  
18 x 27 ^ 3 = 360334

You will get a unique number 360334, which happens to be within 1-1114112 so you can do chr(360334) to get the Unicode character 񗾎. To go back, you do ord('񗾎') to get 360334, which you can continuously divmod to get back the numbers shown below:

360334 %  27 = 19
360334 // 27 = 13345
13345  %  27 = 7
13345  // 27 = 494
494    %  27 = 8
494    // 27 = 18
18     %  27 = 18
18     // 27 = 0 BREAK

The question is: How to make this as a convert and revert function in Python?

Here is my attempt:

def power_sum(values, base, offset = 0):
    return sum(value * base ** (index + offset) for index, value in enumerate(values))

def convert_text(text, chars):
    base = len(chars)
    chars =  {char : index for index, char in enumerate(chars)}
    temp = []
    result = ''
    for index, char in enumerate(text):
        value = chars[char] # indexerror = missing that char in char set
        if power_sum(temp, base, 1) + value > 0x10FFFF: # U+10FFFF is max unicode code point
            result += chr(power_sum(temp, base))
            temp = [value]
        else:
            temp.append(value)
    result += chr(power_sum(temp, base))
    return result
    
def revert_text(text, chars):
    base = len(chars)
    chars = list(chars)
    result = ''
    for char in text:
        value = ord(char)
        while value:
            result += chars[int(value % base)]
            value //= base
    return result

chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars), end = '

')

texts = [
    'this',
    'the quick brown fox jumped over the lazy dog',
    'china'
]

for text in texts:
    print('Start text ({}): {}'.format(len(text), text))
    
    text = convert_text(text, chars)
    print('Unicode text ({}): {}'.format(len(text), text))
    
    text = revert_text(text, chars)
    print('Revert text ({}): {}'.format(len(text), text), end = '

')

Output:

Base: 27

Start text (4): this
Unicode text (1): 񗾎
Revert text (4): this

Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (11): 񽭂늺񒂴񿙳򁈌񊖞񇻉񿿸񽭂񷲄🖛
Revert text (44): the quick brown fox jumped over the lazy dog

Start text (5): china
Unicode text (2): 𿼎
Revert text (4): chin

It fails with the string china for some reason.

1m7 · Accepted Answer

Thanks to chrslg's answer for showing that the code stores any number of trailing index 0 characters as 0 after the indices math. I fixed the code by shifting all the characters in the charset by 1, essentially making the 27 char set = base 28.

def power_sum(values, base, offset = 0):
    return sum(value * base ** (index + offset) for index, value in enumerate(values))

def convert_text(text, chars):
    base = len(chars) + 1
    chars =  {char : index + 1 for index, char in enumerate(chars)}
    temp = []
    result = ''
    for char in text:
        value = chars[char] # indexerror = missing that char in char set
        if value * base ** len(temp) + power_sum(temp, base, 1) > 0x10FFFF: 
            # U+10FFFF is max unicode code point
            result += chr(power_sum(temp, base))
            temp = [value]
        else:
            temp.append(value)
    result += chr(power_sum(temp, base))
    return result
    
def revert_text(text, chars):
    base = len(chars) + 1
    chars = [None] + list(chars)
    result = ''
    for char in text:
        value = ord(char)
        while value:
            result += chars[int(value % base)]
            value //= base
    return result

chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars) + 1, end = '

')

texts = [
    'this',
    'the quick brown fox jumped over the lazy dog',
    'china',
    'aaa',
    'aaaaa',
    'aaaaaaaa',
    'aaaab'
]

for text in texts:
    print('Start text ({}): {}'.format(len(text), text))
    
    text = convert_text(text, chars)
    print('Unicode text ({}): {}'.format(len(text), text))
    
    text = revert_text(text, chars)
    print('Revert text ({}): {}'.format(len(text), text), end = '

')

Output:

Base: 28

Start text (4): this
Unicode text (1): 񧧄
Revert text (4): this

Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (12): 򑮄𑼭񡂟򓢳䬪񉱃򑠜񡥇𜞋򋧻񑖍
Revert text (44): the quick brown fox jumped over the lazy dog

Start text (5): china
Unicode text (2): 񌳳
Revert text (5): china

Start text (3): aaa
Unicode text (1): ̭
Revert text (3): aaa

Start text (5): aaaaa
Unicode text (2): 壭
Revert text (5): aaaaa

Start text (8): aaaaaaaa
Unicode text (2): 壭壭
Revert text (8): aaaaaaaa

Start text (5): aaaab
Unicode text (2): 壭
Revert text (5): aaaab

How to compress text length using Unicode and bases, and be able to revert it back?

Answers (2)

Related Questions