llrs
llrs

Reputation: 3397

Reading one big file vs open thousands of files

I have 20333 files that together make 93M, each one can weight between 136b and 956K. I need to read data from these tab separated files (*.tsv)

I am considering to append them in one file (to avoid opening and closing files) while I download them from a ftp server.
To open and read any file I use the following function:

def read_file(file_):
    with open(file_) as f:
        for line in f:
            yield line.split("\t")

Would it be a good idea to improve performance ?

Upvotes: 2

Views: 338

Answers (2)

Charles Duffy
Charles Duffy

Reputation: 295500

Yes, concatenating contents into a single file would improve performance -- if for no other reason, because this would allow contents to be pipelined.

Retrieving a series of files requires a significant number of request/response pairs; while the server is waiting for a new command from the client, bandwidth which could otherwise be used is wasted, unless one adds significant complexity and logic to avoid this (running multiple concurrent FTP connections, for instance).

By contrast, retrieving a large file allows the server to continually send content until it loses ACKs from the client (telling it to slow down). This will result in significantly better throughput.

Upvotes: 1

Paddy D
Paddy D

Reputation: 1

I think the guys have answer your question in terms of efficiency. I just want to add the following:

To open and read files from a directory/folder you could use the following code below. Hope this was on some help.

import glob
output = r"path\to\file\i\want\to\write\too\output.txt"
with open(output, 'w') as outfile:
    for file_name in glob.glob("/path/to/folder/containing/files/*.txt"):
        with open(file_name) as infile:
            outfile.write(infile.read())

Upvotes: 0

Related Questions