amza
amza

Reputation: 810

Writing bits as bits to a file

So file systems deal with bytes but I'm looking to read/write data to a file in bits.

I have a file that is ~ 850mb and the goal is to get it under 100 mb. I used delta + huffman encoding to generate a "code table" of binary. When you add all "bits" (aka the total number of 0s and 1s in the file) you get about 781,000,000 "bits" so theoretically I should be able to store these in about 90mb or so. This is where I'm running into a problem.

Based on other answers I've seen around SO, this is the closest I've gotten:

with open(r'encoded_file.bin', 'wb') as f:
    for val in filedict:
            int_val = int(val[::-1], base=2)
            bin_array = struct.pack('i', int_value)
            f.write(bin_array)

The val being passed along each iteration is the binary to be written. These do not have a fixed length and range from 10 from the most common to 111011001111001100 for the longest. The average code length is 5 bits. The above code generates a file of about 600mb, still way off the target.

Currently I am using Python 2.7, I can get to Python 3.x if I absolutely have to. Is it even possible in Python? Could a language like C or C++ do it easier?

Upvotes: 2

Views: 361

Answers (1)

Tadhg McDonald-Jensen
Tadhg McDonald-Jensen

Reputation: 21453

Note: because the bytes object is just an alias to str in python 2 I wasn't able to find (decent) way of writing the following that worked for both versions without using if USING_VS_3.

As a minimal interface to go from a string of bits to bytes that can be written to a file you can use something like this:

def _gen_parts(bits):
    for start in range(0,len(bits),8):
        b = int(bits[start:start+8], base=2)
        if USING_VS_3:
            yield b #bytes takes an iterator of ints
        else:
            yield chr(b)

def bits_to_bytes(bits): # -> (bytes, "leftover")
    split_i = -(len(bits)%8)
    byte_gen = _gen_parts(bits[:split_i])
    if USING_VS_3:
        whole = bytes(byte_gen)
    else:
        whole = "".join(byte_gen)
    return whole, bits[split_i:]

So giving a string of binary data like '111011001111001100' tobits_to_bytes` will return a 2 item tuple of (byte data to write to file) and (left over bits).

Then a simple and un-optimized file interface to handle the partial-byte-buffer could be like this:

class Bit_writer:
    def __init__(self,file):
        self.file = file
        self.buffer = ""

    def write(self,bits):
        byte_data, self.buffer = bits_to_bytes(self.buffer + bits)
        self.file.write(byte_data)

    def close(self):
        #you may want to handle the padding differently?
        byte_data,_ = bits_to_bytes("{0.buffer:0<8}".format(self))
        self.file.write(byte_data)
        self.file.close()

    def __enter__(self): # This will let you use a 'with' block
        return self
    def __exit__(self,*unused):
        self.file.close()

Upvotes: 2

Related Questions