Reputation: 11
There are 44 characters in this text: the quick brown fox jumped over the lazy dog
The same text can be represented with just 11 Unicode characters:
(These characters look the same as "[]" but they are all different characters!)
This is because the ASCII characters can range from 1-27 (base 27 if you only use the 27 characters in this character set abcdefghijklmnopqrstuvwxyz
) and Unicode characters range from 1-1114112, which means you can store multiple numbers in a bigger number if you do indices-related math.
For example, the text this
looks like [19, 7, 8, 18]
if you convert each character to their index in the above base 27 character set. If you do the calculation below:
19 x 27 ^ 0 +
7 x 27 ^ 1 +
8 x 27 ^ 2 +
18 x 27 ^ 3 = 360334
You will get a unique number 360334
, which happens to be within 1-1114112 so you can do chr(360334)
to get the Unicode character
. To go back, you do ord('')
to get 360334
, which you can continuously divmod to get back the numbers shown below:
360334 % 27 = 19
360334 // 27 = 13345
13345 % 27 = 7
13345 // 27 = 494
494 % 27 = 8
494 // 27 = 18
18 % 27 = 18
18 // 27 = 0 BREAK
The question is: How to make this as a convert and revert function in Python?
Here is my attempt:
def power_sum(values, base, offset = 0):
return sum(value * base ** (index + offset) for index, value in enumerate(values))
def convert_text(text, chars):
base = len(chars)
chars = {char : index for index, char in enumerate(chars)}
temp = []
result = ''
for index, char in enumerate(text):
value = chars[char] # indexerror = missing that char in char set
if power_sum(temp, base, 1) + value > 0x10FFFF: # U+10FFFF is max unicode code point
result += chr(power_sum(temp, base))
temp = [value]
else:
temp.append(value)
result += chr(power_sum(temp, base))
return result
def revert_text(text, chars):
base = len(chars)
chars = list(chars)
result = ''
for char in text:
value = ord(char)
while value:
result += chars[int(value % base)]
value //= base
return result
chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars), end = '\n\n')
texts = [
'this',
'the quick brown fox jumped over the lazy dog',
'china'
]
for text in texts:
print('Start text ({}): {}'.format(len(text), text))
text = convert_text(text, chars)
print('Unicode text ({}): {}'.format(len(text), text))
text = revert_text(text, chars)
print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')
Output:
Base: 27
Start text (4): this
Unicode text (1):
Revert text (4): this
Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (11): 늺🖛
Revert text (44): the quick brown fox jumped over the lazy dog
Start text (5): china
Unicode text (2):
Revert text (4): chin
It fails with the string china
for some reason.
Upvotes: 0
Views: 117
Reputation: 11
Thanks to chrslg's answer for showing that the code stores any number of trailing index 0 characters as 0
after the indices math. I fixed the code by shifting all the characters in the charset by 1, essentially making the 27 char set = base 28.
def power_sum(values, base, offset = 0):
return sum(value * base ** (index + offset) for index, value in enumerate(values))
def convert_text(text, chars):
base = len(chars) + 1
chars = {char : index + 1 for index, char in enumerate(chars)}
temp = []
result = ''
for char in text:
value = chars[char] # indexerror = missing that char in char set
if value * base ** len(temp) + power_sum(temp, base, 1) > 0x10FFFF:
# U+10FFFF is max unicode code point
result += chr(power_sum(temp, base))
temp = [value]
else:
temp.append(value)
result += chr(power_sum(temp, base))
return result
def revert_text(text, chars):
base = len(chars) + 1
chars = [None] + list(chars)
result = ''
for char in text:
value = ord(char)
while value:
result += chars[int(value % base)]
value //= base
return result
chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars) + 1, end = '\n\n')
texts = [
'this',
'the quick brown fox jumped over the lazy dog',
'china',
'aaa',
'aaaaa',
'aaaaaaaa',
'aaaab'
]
for text in texts:
print('Start text ({}): {}'.format(len(text), text))
text = convert_text(text, chars)
print('Unicode text ({}): {}'.format(len(text), text))
text = revert_text(text, chars)
print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')
Output:
Base: 28
Start text (4): this
Unicode text (1):
Revert text (4): this
Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (12): 𑼭䬪
Revert text (44): the quick brown fox jumped over the lazy dog
Start text (5): china
Unicode text (2):
Revert text (5): china
Start text (3): aaa
Unicode text (1): ̭
Revert text (3): aaa
Start text (5): aaaaa
Unicode text (2): 壭
Revert text (5): aaaaa
Start text (8): aaaaaaaa
Unicode text (2): 壭壭
Revert text (8): aaaaaaaa
Start text (5): aaaab
Unicode text (2): 壭
Revert text (5): aaaab
Upvotes: 0
Reputation: 13491
Not commenting on the idea and principle of this "compression" (and there would be some comments to make. But that is not your question).
But just look at how you would encode string "aaa".
In base 27, it is 27²×0 + 27×0 + 0
so, 0, right?
But that is also code of "a". Or "aaaaaaaa".
Since you add more weight to the last character (somehow, you are "little endian" in your encoding), you are in the same situation as decimal numbers, but in the reverse order (decimal numbers are "big endians"). I mean by that that in decimal number initial 0 are meaning less. And that 001234 is the same thing as 01234 or 1234 (in everyday decimal. Not in C or python where it triggers octal (for C) or is a syntax error (for python)).
In your base, it is trailing 0 (or a
) that are meaningless. So china
is the same thing as chin
or chinaaaaa
.
Upvotes: 0