Python process multiple files iteratively, without an explicit for loop

Question

I have a script that uses a large chunk of text to train a model. The way it's written now I can either read from a file or stdin

parser.add_argument('-i', help='input_file', default=sys.stdin)
... # do a bunch of other stuff
if args.i is sys.stdin:
    m.train(args.i)
else:
    m.train(open(args.i, 'r'))

then I can call my script as:

python myscript.py -i trainingdata.txt

or

cat trainingdata.txt | python myscript.py

The second version is especially useful if I want to search the filesystem, and use multiple files to train the model. However this becomes tricky, due to the pipe, if I simultaneously try to profile using cProfiler i.e.

python -m cProfile myscript.py ...

I know that I can send it multiple files using the -i option, and iterate over the files, but then I will have to change the behaviour of the train() method to avoid overwriting data.

Is there a good way to open an IO channel, for the lack of a better expression, that concatenates the input without explicitly reading and writing line by line?

hiro protagonist · Accepted Answer

you can chain open files and use a generator to yield open files from the filenames:

from itertools import chain

def yield_open(filenames):
    for filename in filenames:
        with open(filename, 'r') as file:
            yield file

def train(file):
    for line in file:
        print(line, end='')
    print()

files = chain.from_iterable(yield_open(filenames=['file1.txt', 'file2.txt']))
train(files)

this has the added benefit that only one of your files is open at the time.

you could also use that as a 'data pipeline' (may be more readable):

file_gen = yield_open(filenames=['file1.txt', 'file2.txt'])
files = chain.from_iterable(file_gen)
train(files)

Python process multiple files iteratively, without an explicit for loop

Answers (1)

Related Questions