Jon Wexler
Jon Wexler

Reputation: 161

Compress a number as a string to fit in 256 char space

I'm trying to use a bitmask to provide as many binary values as possible so that the final value will store in the limited allocate memory for a string. My current methodology is to find a maximum number and convert it to a string base-36.

value = (0 | (1<<1318)).to_s(36)

The result is 255 chars of a compressed number from which I can extract my original number of 1318. The downside is I'm limited to 1,318 binary values and I want to expand that number. Are there any alternative strategies in Ruby to compress this number even further?

Upvotes: 1

Views: 289

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110725

Numbers are non-negative

If the numbers are non-negative we can encode each 8-bits of each number to a character that is part of a string, and then decode the string by converting each character to 8 bits of the number.

def encode(n)
  str = ''
  until n.zero?
    str << (n & 255).chr
    n = n >> 8
  end
  str.reverse
end

def decode(str)     
  str.each_char.reduce(0) { |n,c| (n << 8) | c.ord }
end

This uses the following bit-manipulation methods in the class Integer: &, >>, << and |.

def test(n)
  encoded = encode(n)
  puts "#{n} => #{encoded} => #{decode(encoded)}"
end

test      1  #      1 => ["\u0001"]            =>      1
test     63  #     63 => ["?"]                 =>     63
test     64  #     64 => ["@"]                 =>     64
test    255  #    255 => ["\xFF"]              =>    255
test    256  #    256 => ["\u0001", "\u0000"]  =>    256
test 123456  # 123456 => ["\x01", "\xE2", "@"] => 123456

For example,

n = 123456
n.to_s(2)
  #=> "11110001001000000"

so

n = 0b11110001001000000
  #=> 123456

The bytes of this number can be visualized so:

00000001 11100010 01000000

We see that

a = [0b00000001, 0b11100010, 0b01000000]
a.map(&:chr)
  #=> ["\x01", "\xE2", "@"]

Numbers can be negative

If the numbers to be encoded can be negative we need to first convert then to their absolute values then add some information to the encoded string that indicates whether they are non-negative or negative. That will require at least one additional byte so we might include a "+" for non-negative numbers and a "-" for negative numbers.

def encode(n)
  sign = "+"
  if n < 0
    sign = "-"
    n = -n
  end
  str = ''
  until n.zero?
    str << (n & 255).chr
    n = n >> 8
  end
  (str << sign).reverse
end

def decode(str)
  n = str[1..-1].each_char.reduce(0) { |n,c| (n << 8) | c.ord }
  str[0] == '+' ? n : -n
end

test    -255  # -255    => ["-", "\xFF"]              => -255
test    -256  # -256    => ["-", "\u0001", "\u0000"]  => -256
test -123456  # -123456 => ["-", "\x01", "\xE2", "@"] => -123456
test  123456  #  123456 => ["+", "\x01", "\xE2", "@"] =>  123456

Upvotes: 1

Robindar
Robindar

Reputation: 484

You can always encode your number into base s and then represent that as string with whatever alphabet you want.

def encode(n, alphabet)
  s = alphabet.size
  res = []
  while (n > 0)
    res << n % s
    n = n / s
  end
  res.reverse.map { |i| alphabet[i] }.join
end

Your method is then equivalent to encode(n, alphabet), where alphabet is defined as

alphabet = ((0..9).to_a + ("a".."z").to_a).join
# => "0123456789abcdefghijklmnopqrstuvwxyz"

But you might as well use all possible characters instead of only 36 of them:

extended_alphabet = (0..255).map { |i| i.chr }.join

This gives a total of (256 ** 255) possibilities, i.e. up to (2 ** 2040), which is much better than your actual (2 ** 1318).

This encoding happens to be optimal because each character of your string can have at most 256 different values, and all of them are used here.


Decoding can then be performed as follows:

def decode(encoded, alphabet)
  s = alphabet.size
  n = 0
  decode_dict = {}; i = -1
  alphabet.each_char { |c| decode_dict[c] = (i += 1) }
  encoded.each_char do |c|
    n = n * s + decode_dict[c]
  end
  n
end

If you are going to use a fixed alphabet for all your encodings, I would suggest computing the decoding dictionnary outside of the function and taking it as a parameter instead of alphabet, to avoid computing it every time you try to encode a number.

Upvotes: 1

Related Questions