Devilhunter
Devilhunter

Reputation: 133

How can I densely store large numbers in a file?

I need to store and handle huge amounts of very long numbers, which are in range from 0 to f 64 times (ffffffffff.....ffff).

If I store these numbers in a file, I need 1 byte for each character (digit) + 2 bytes for \n symbol = up to 66 bytes. However to represent all possible numbers we need not more than 34 bytes (4 bits represent digits from 0 to f, therefore 4 [bits] * 64 [amount of hex digits]/8 [bits a in byte] = 32 bytes + \n, of course).

Is there any way to store the number without consuming excess memory?

So far I have created converter from hex (with 16 digits per symbol) to a number with base of 76 (hex + all letters and some other symbols), which reduces size of a number to 41 + 2 bytes.

Upvotes: 2

Views: 2343

Answers (3)

wecsam
wecsam

Reputation: 2761

If you anticipate storing an even distribution of numbers, then see Mad Physicist's answer. However, If you anticipate storing mostly small numbers but need to be able to store a few large numbers, then these schemes may also be useful.

If you only need to account for integers that are 255 or fewer bytes (2040 or fewer bits) in length, then simply convert the int to a bytes object and store the length in an additional byte, like this:

# This was only tested with non-negative integers!
def encode(num):
    assert isinstance(num, int)
    # Convert the number to a byte array and strip away leading null bytes.
    # You can also use byteorder="little" and rstrip.
    # If the integer does not fit into 255 bytes, an OverflowError will be raised.
    encoded = num.to_bytes(255, byteorder="big").lstrip(b'\0')
    # Return the length of the integer in the first byte, followed by the encoded integer.
    return bytes([len(encoded)]) + encoded
def encode_many(nums):
    return b''.join(encode(num) for num in nums)
def decode_many(byte_array):
    assert isinstance(byte_array, bytes)
    result = []
    start = 0
    while start < len(byte_array):
        # The first byte contains the length of the integer.
        int_length = byte_array[start]
        # Read int_length bytes and decode them as int.
        new_int = int.from_bytes(byte_array[(start+1):(start+int_length+1)], byteorder="big")
        # Add the new integer to the result list.
        result.append(new_int)
        start += int_length + 1
    return result

To store integers of (practically) infinite length, you can use this scheme, based on variable-length quantities in the MIDI file format. First, the rules:

  • A byte has eight bits (for those who don't know).
  • In each byte except the last, the left-most bit (the highest-order bit) will be 1.
  • The lower seven bits (i.e. all bits except the left-most bit) in each byte, when concatenated together, form an integer with a variable number of bits.

Here are a few examples:

  • 0 in binary is 00000000. It can be represented in one byte without modification as 00000000.
  • 127 in binary is 01111111. It can be represented in one byte without modification as 01111111.
  • 128 in binary is 10000000. It must be converted to a two-byte representation: 10000001 00000000. Let's break that down:
    • The left-most bit in the first byte is 1, which means that it is not the last byte.
    • The left-most bit in the second byte is 0, which means that it is the last byte.
    • The lower seven bits in the first byte are 0000001, and the lower seven bits in the second byte are 0000000. Concatenate those together, and you get 00000010000000, which is 128.
  • 173249806138790 in binary is 100111011001000111011101001001101111110110100110.
    • To store it:
      • First, split the binary number into groups of seven bits: 0100111 0110010 0011101 1101001 0011011 1111011 0100110 (a leading 0 was added)
      • Then, add a 1 in front of each byte except the last, which gets a 0: 10100111 10110010 10011101 11101001 10011011 11111011 00100110
    • To retrieve it:
      • First, drop the first bit of each byte: 0100111 0110010 0011101 1101001 0011011 1111011 0100110
      • You are left with an array of seven-bit segments. Join them together: 100111011001000111011101001001101111110110100110
      • When that is converted to decimal, you get 173,249,806,138,790.

Why, you ask, do we make the left-most bit in the last byte of each number a 0? Well, doing that allows you to concatenate multiple numbers together without using line breaks. When writing the numbers to a file, just write them one after another. When reading the numbers from a file, use a loop that builds an array of integers, ending each integer whenever it detects a byte where the left-most bit is 0.

Here are two functions, encode and decode, which convert between int and bytes in Python 3.

# Important! These methods only work with non-negative integers!
def encode(num):
    assert isinstance(num, int)
    # If the number is 0, then just return a single null byte.
    if num <= 0:
        return b'\0'
    # Otherwise...
    result_bytes_reversed = []
    while num > 0:
        # Find the right-most seven bits in the integer.
        current_seven_bit_segment = num & 0b1111111
        # Change the left-most bit to a 1.
        current_seven_bit_segment |= 0b10000000
        # Add that to the result array.
        result_bytes_reversed.append(current_seven_bit_segment)
        # Chop off the right-most seven bits.
        num = num >> 7
    # Change the left-most bit in the lowest-order byte (which is first in the list) back to a 0.
    result_bytes_reversed[0] &= 0b1111111
    # Un-reverse the order of the bytes and convert the list into a byte string.
    return bytes(reversed(result_bytes_reversed))
def decode(byte_array):
    assert isinstance(byte_array, bytes)
    result = 0
    for part in byte_array:
        # Shift the result over by seven bits.
        result = result << 7
        # Add in the right-most seven bits from this part.
        result |= (part & 0b1111111)
    return result

Here are two functions for working with lists of ints:

def encode_many(nums):
    return [encode(num) for num in nums]
def decode_many(byte_array):
    parts = []
    # Split the byte array after each byte where the left-most bit is 0.
    start = 0
    for i, b in enumerate(byte_array):
        # Check whether the left-most bit in this byte is 0.
        if not (b & 0b10000000):
            # Copy everything up to here into a new part.
            parts.append(byte_array[start:(i+1)])
            start = i + 1
    return [decode(part) for part in parts]

Upvotes: 4

Mad Physicist
Mad Physicist

Reputation: 114350

You are trying to store 32 bytes long. Why not just store them as binary numbers? That way you need to store only 32 bytes per number instead of 41 or whatever. You can add on all sorts of quasi-compression schemes to take advantage of things like most of your numbers being shorter than 32 bytes.

If your number is a string, convert it to an int first. Python3 ints are basically infinite precision, so you will not lose any information:

>>> num = '113AB87C877AAE3790'
>>> num = int(num, 16)
>>> num
317825918024297625488

Now you can convert the result to a byte array and write it to a file opened for binary writing:

with open('output.bin', 'wb') as file:
    file.write(num.to_bytes(32, byteorder='big'))

The int method to_bytes converts your number to a string of bytes that can be placed in a file. You need to specify the string length and the order. 'big' makes it easier to read a hex dump of the file.

To read the file back and decode it using int.from_bytes in a similar manner:

with open('output.bin', 'rb') as file:
    bytes = file.read(32)
    num = int.from_bytes(bytes, byteorder='big')

Remember to always include the b in the file mode, or you may run into unexpected problems if you try to read or write data with codes for \n in it.

Both the read and write operation can be looped as a matter of course.

Upvotes: 7

Artyer
Artyer

Reputation: 40836

The densest possible way without knowing more about the numbers would be 256 bits per number (32 bytes).

You can store them right after one another.

A function to write to a file might look like this:

def write_numbers(numbers, file):
    for n in numbers:
        file.write(n.to_bytes(32, 'big'))

with open('file_name', 'wb') as f:
    write_numbers(get_numbers(), f)

And to read the numbers, you can make a function like this:

def read_numbers(file):
    while True:
        read = file.read(32)
        if not read:
            break
        yield int.from_bytes(read, 'big')

with open('file_name', 'rb') as f:
    for n in read_numbers(f):
        do_stuff(n)

Upvotes: 2

Related Questions