Ali Yılmaz
Ali Yılmaz

Reputation: 1695

writing to files: switch to new file after X MB file capacity

I have millions of domains which I will send WHOIS query and record WHOIS response on some .txt file.

I would like to set maximum capacity for a single .txt output file. For example, let's say I started recording responses on out0.txt. I want to switch to out1.txt if out0.txt is >= 100mb. Same thing goes for out1.txt, if out1.txt>=100mb then start writing to out2.txtand so on.

I know that I can do if checks after each insertion, but I want my code to be fast: i.e. I thought if checks at each domain can slow down my code. (It will asynchronously query millions of domains).

I imagined a try-except block could solve my issue here, like this:

folder_name = "out%s.txt"
folder_number = 0


folder_name = folder_name % folder_number
f = open(folder_name, 'w+')

for domain in millions_of_domains:
   try:
      response_json = send_whois_query(domain)
      f.write(response_json)
   except FileGreaterThan100MbException:
      folder_number += 1
      folder_name = folder_name % folder_number
      f = open(folder_name, 'w+')
      f.write(response_json)

Any suggestions will be appreciated. Thank you for your time.

Upvotes: 0

Views: 35

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121406

You can create a wrapper object that tracks how much data has been written, and opens a new file if you reached a limit:

class MaxSizeFileWriter(object):
    def __init__(self, filenamepattern, maxdata=2**20,  # default 1Mb
                 start=0, mode='w', *args, **kwargs):
        self._pattern = filenamepattern
        self._counter = start

        self._mode = mode
        self._args, self._kwargs = args, kwargs

        self._max = maxdata

        self._openfile = None
        self._written = 0

    def _open(self):
        if self._openfile is not None:
            filename = self._pattern.format(self._counter)
            self._counter += 1
            self._openfile = open(filename, mode=self._mode, *self._args, **self._kwargs)

    def _close(self):
        if self._openfile is not None:
            self._openfile.close()

    def __enter__(self):
        return self

    def __exit__(self, *args, **kwargs):
        if self._openfile is not None:
            self._openfile.close()

    def write(self, data):
        if self._written + len(data) > self._max:
            # current file too full to fit data too, close it
            # This will trigger a new file to be opened.
            self._close()
        self._open()  # noop if already open
        self._openfile.write(data)
        self._written += len(data)

The above is a context manager, and can be used just like a regular file. Pass in a filename with a {} placeholder for the number to be inserted into:

folder_name = "out{}.txt"

with MaxSizeFileWriter(folder_name, maxdata=100 * 2**10) as f:
    for domain in millions_of_domains:
        response_json = send_whois_query(domain)
        f.write(response_json)

Upvotes: 1

Related Questions