jlmr
jlmr

Reputation: 165

Compress a bitstring using Ruby

For a Bioinformatics project I need to compress a large amount of bitstrings (strings containing only 0's and 1's) in Ruby to smaller strings to reduce memory usage.

So ideally a string like "0001010010010010010001001" becomes something like "2a452c66". I first used MD5 hashes until I read something about possible collisions which I would like to avoid.

I have tried a lot of different combinations of unpack, to_i, to_s, etc, but can't seem to get the right combination.

The solution should:

Thanks!

Upvotes: 3

Views: 690

Answers (2)

George Powell
George Powell

Reputation: 9509

Just an interesting observation: If you want to convert a base-2 string into a higher base (say base-n), the compression ratio is 1/log2(n). This means that if you convert to hex as the other answer suggests, you'll get compression of 25% of the original. Moving all the way up to base 64 (only 2 more symbols than pure alphanumeric) you'll jump to about 17% compression. It just depends where you want to sit on that tradeoff!

Alternatively, if you could get rid of the reversibility requirement, only preserving equality, MD5 would be fine. See How many random elements before MD5 produces collisions? spoiler: it's a lot. The collision "problems" you will read about are purposeful collisions; cryptographers using their knowledge of MD5 to find collisions. Accidental collisions are impossible for all practical purposes.

Update

In terms of actually implementing a base64 encode in Ruby, I don't know. I don't actually know Ruby. What I'd do if it's not natively supported is make an array of all the alphanumeric characters + 2 (so the array is 64 long) and then convert chunks of 6 binary digits into the corresponding character by using that 6 digit binary number as an index in the character array. If you wanted to use 62 (or any other non-power-of-two) then the algorithm would be different.

Upvotes: 3

the Tin Man
the Tin Man

Reputation: 160581

Try:

FORMAT = '%0.*b'

bitmask = "0001010010010010010001001"
bitmask.to_i(2) # => 2696329
hexval = bitmask.to_i(2).to_s(16) # => "292489"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "0001010010010010010001001"

What it's doing is:

  • to_i(2) converts the string from binary to its integer value just to show what's happening.
  • to_i(2).to_s(16) converts it to its hexadecimal representation as a string.
  • FORMAT contains a printf format string saying to convert the passed in value into its binary string representation (%b) with leading 0 bytes (%0b) for an unknown length (%0.*b) which it gets from the first parameter passed (bitmask.size).

Here's another example using a longer bitmask:

bitmask = "11011110101011011011111011101111"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeef"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "11011110101011011011111011101111"

And longer still:

bitmask = "1101111010101101101111101110111111111110111011011010110111011010"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeeffeedadda"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "1101111010101101101111101110111111111110111011011010110111011010"

Upvotes: 4

Related Questions