VladoPortos
VladoPortos

Reputation: 602

Generating and writing file bigger than system RAM in python

I know of this nice code that can generate files specific sizes and write them down .

def file_generator(location, size):
    filename = str(uuid.uuid4())
    with open('{0}{1}'.format(location, filename), 'wb') as target:
        target.write(os.urandom(size))
    return filename

How ever there is one small issue, it can't generate files bigger than system RAM it will fail with MemoryError, any idea how to write out the file in stream or somehow go around this issue ?

Upvotes: 0

Views: 1439

Answers (3)

CristiFati
CristiFati

Reputation: 41147

When dealing with this kind of problem, the solution is breaking the data in chunks, choosing a favorable chunk size that:

  • Is less than some limits there's no control on (in this case, RAM size)
  • Is not too small, so the process doesn't take forever

In the example below, the wanted file size is split in (32 MiB) chunks (resulting a number (>= 0) of complete chunks, and possibly an incomplete chunk at the end).

code.py:

import sys
import os
import uuid


DEFAULT_CHUNK_SIZE = 33554432  # 32 MiB


def file_generator(location, size):
    filename = str(uuid.uuid4())
    with open('{0}{1}'.format(location, filename), 'wb') as target:
        target.write(os.urandom(size))
    return filename


def file_generator_chunked(location, size, chunk_size=DEFAULT_CHUNK_SIZE):
    file_name = str(uuid.uuid4())
    chunks = size // chunk_size
    last_chunk_size = size % chunk_size
    with open("{0}{1}".format(location, file_name), "wb") as target:
        for _ in range(chunks):
            target.write(os.urandom(chunk_size))
        if last_chunk_size:
            target.write(os.urandom(last_chunk_size))
    return file_name


def main():
    file_name = file_generator_chunked("/tmp", 100000000)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Upvotes: 2

Thomas Weller
Thomas Weller

Reputation: 59575

os.urandom returns a string of the specified size. That string first needs to fit in memory. If that were a generator, things would work in a more memory efficient way.

It has nothing to do with system memory, however. It does not depend on the amount of physical RAM installed on your machine. It's limited by virtual memory, which is ~ 8TB for 64 bit programs on 64 bit Windows. However, that may involve swapping to disk, which becomes slow.

Therefore, potential solutions are:

  1. switch from 32 bit Python to 64 bit Python and you'll not need to change the program at all. It will become significantly slower when you reach the end of physical RAM.
  2. write the file in smaller parts, say 10 MB at a time.

In contrast to @quamrana's answer, I would not change the method signature. The caller could still choose 1 block à 8 GB, which has the same effect as before.

The following takes that burden from the caller:

def file_generator(location, size):
    filename = str(uuid.uuid4())
    chunksize = 10*1024*1024
    with open('{0}{1}'.format(location, filename), 'wb') as target:
        while size>chunksize:
            target.write(os.urandom(chunksize))
            size -= chunksize
        target.write(os.urandom(size))
    return filename

Upvotes: 3

quamrana
quamrana

Reputation: 39404

Write the file in blocks:

def large_file_generator(location, block_size, number_of_blocks):
    filename = str(uuid.uuid4())
    with open('{0}{1}'.format(location, filename), 'wb') as target:
        for _ in range(number_of_blocks):
            target.write(os.urandom(block_size))
    return filename

Upvotes: 0

Related Questions