Reading one big file vs open thousands of files

Question

I have 20333 files that together make 93M, each one can weight between 136b and 956K. I need to read data from these tab separated files (*.tsv)

I am considering to append them in one file (to avoid opening and closing files) while I download them from a ftp server.
To open and read any file I use the following function:

def read_file(file_):
    with open(file_) as f:
        for line in f:
            yield line.split("	")

Would it be a good idea to improve performance ?

Charles Duffy · Accepted Answer

Yes, concatenating contents into a single file would improve performance -- if for no other reason, because this would allow contents to be pipelined.

Retrieving a series of files requires a significant number of request/response pairs; while the server is waiting for a new command from the client, bandwidth which could otherwise be used is wasted, unless one adds significant complexity and logic to avoid this (running multiple concurrent FTP connections, for instance).

By contrast, retrieving a large file allows the server to continually send content until it loses ACKs from the client (telling it to slow down). This will result in significantly better throughput.

Reading one big file vs open thousands of files

Answers (2)

Related Questions