1m7
1m7

Reputation: 11

How to compress text length using Unicode and bases, and be able to revert it back?

There are 44 characters in this text: the quick brown fox jumped over the lazy dog

The same text can be represented with just 11 Unicode characters: 񜥎񐟾𴬔񇒉𚫔𮹂𓻣񥯨񜥎𵁼񽤙
(These characters look the same as "[]" but they are all different characters!)

This is because the ASCII characters can range from 1-27 (base 27 if you only use the 27 characters in this character set abcdefghijklmnopqrstuvwxyz ) and Unicode characters range from 1-1114112, which means you can store multiple numbers in a bigger number if you do indices-related math.

For example, the text this looks like [19, 7, 8, 18] if you convert each character to their index in the above base 27 character set. If you do the calculation below:

19 x 27 ^ 0 +  
7  x 27 ^ 1 +  
8  x 27 ^ 2 +  
18 x 27 ^ 3 = 360334

You will get a unique number 360334, which happens to be within 1-1114112 so you can do chr(360334) to get the Unicode character 񗾎. To go back, you do ord('񗾎') to get 360334, which you can continuously divmod to get back the numbers shown below:

360334 %  27 = 19
360334 // 27 = 13345
13345  %  27 = 7
13345  // 27 = 494
494    %  27 = 8
494    // 27 = 18
18     %  27 = 18
18     // 27 = 0 BREAK

The question is: How to make this as a convert and revert function in Python?

Here is my attempt:

def power_sum(values, base, offset = 0):
    return sum(value * base ** (index + offset) for index, value in enumerate(values))

def convert_text(text, chars):
    base = len(chars)
    chars =  {char : index for index, char in enumerate(chars)}
    temp = []
    result = ''
    for index, char in enumerate(text):
        value = chars[char] # indexerror = missing that char in char set
        if power_sum(temp, base, 1) + value > 0x10FFFF: # U+10FFFF is max unicode code point
            result += chr(power_sum(temp, base))
            temp = [value]
        else:
            temp.append(value)
    result += chr(power_sum(temp, base))
    return result
    
def revert_text(text, chars):
    base = len(chars)
    chars = list(chars)
    result = ''
    for char in text:
        value = ord(char)
        while value:
            result += chars[int(value % base)]
            value //= base
    return result

chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars), end = '\n\n')

texts = [
    'this',
    'the quick brown fox jumped over the lazy dog',
    'china'
]

for text in texts:
    print('Start text ({}): {}'.format(len(text), text))
    
    text = convert_text(text, chars)
    print('Unicode text ({}): {}'.format(len(text), text))
    
    text = revert_text(text, chars)
    print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')

Output:

Base: 27

Start text (4): this
Unicode text (1): 񗾎
Revert text (4): this

Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (11): 񽭂늺񒂴񿙳򁈌񊖞񇻉񿿸񽭂񷲄🖛
Revert text (44): the quick brown fox jumped over the lazy dog

Start text (5): china
Unicode text (2): 𿼎
Revert text (4): chin

It fails with the string china for some reason.

Upvotes: 0

Views: 117

Answers (2)

1m7
1m7

Reputation: 11

Thanks to chrslg's answer for showing that the code stores any number of trailing index 0 characters as 0 after the indices math. I fixed the code by shifting all the characters in the charset by 1, essentially making the 27 char set = base 28.

def power_sum(values, base, offset = 0):
    return sum(value * base ** (index + offset) for index, value in enumerate(values))

def convert_text(text, chars):
    base = len(chars) + 1
    chars =  {char : index + 1 for index, char in enumerate(chars)}
    temp = []
    result = ''
    for char in text:
        value = chars[char] # indexerror = missing that char in char set
        if value * base ** len(temp) + power_sum(temp, base, 1) > 0x10FFFF: 
            # U+10FFFF is max unicode code point
            result += chr(power_sum(temp, base))
            temp = [value]
        else:
            temp.append(value)
    result += chr(power_sum(temp, base))
    return result
    
def revert_text(text, chars):
    base = len(chars) + 1
    chars = [None] + list(chars)
    result = ''
    for char in text:
        value = ord(char)
        while value:
            result += chars[int(value % base)]
            value //= base
    return result

chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars) + 1, end = '\n\n')

texts = [
    'this',
    'the quick brown fox jumped over the lazy dog',
    'china',
    'aaa',
    'aaaaa',
    'aaaaaaaa',
    'aaaab'
]

for text in texts:
    print('Start text ({}): {}'.format(len(text), text))
    
    text = convert_text(text, chars)
    print('Unicode text ({}): {}'.format(len(text), text))
    
    text = revert_text(text, chars)
    print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')

Output:

Base: 28

Start text (4): this
Unicode text (1): 񧧄
Revert text (4): this

Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (12): 򑮄𑼭񡂟򓢳䬪񉱃򑠜񡥇𜞋򋧻񑖍
Revert text (44): the quick brown fox jumped over the lazy dog

Start text (5): china
Unicode text (2): 񌳳
Revert text (5): china

Start text (3): aaa
Unicode text (1): ̭
Revert text (3): aaa

Start text (5): aaaaa
Unicode text (2): 壭
Revert text (5): aaaaa

Start text (8): aaaaaaaa
Unicode text (2): 壭壭
Revert text (8): aaaaaaaa

Start text (5): aaaab
Unicode text (2): 壭
Revert text (5): aaaab

Upvotes: 0

chrslg
chrslg

Reputation: 13491

Not commenting on the idea and principle of this "compression" (and there would be some comments to make. But that is not your question).

But just look at how you would encode string "aaa".

In base 27, it is 27²×0 + 27×0 + 0 so, 0, right?

But that is also code of "a". Or "aaaaaaaa".

Since you add more weight to the last character (somehow, you are "little endian" in your encoding), you are in the same situation as decimal numbers, but in the reverse order (decimal numbers are "big endians"). I mean by that that in decimal number initial 0 are meaning less. And that 001234 is the same thing as 01234 or 1234 (in everyday decimal. Not in C or python where it triggers octal (for C) or is a syntax error (for python)).

In your base, it is trailing 0 (or a) that are meaningless. So china is the same thing as chin or chinaaaaa.

Upvotes: 0

Related Questions