Reputation: 6562
I have a script that uses a large chunk of text to train a model. The way it's written now I can either read from a file or stdin
parser.add_argument('-i', help='input_file', default=sys.stdin)
... # do a bunch of other stuff
if args.i is sys.stdin:
m.train(args.i)
else:
m.train(open(args.i, 'r'))
then I can call my script as:
python myscript.py -i trainingdata.txt
or
cat trainingdata.txt | python myscript.py
The second version is especially useful if I want to search the filesystem, and use multiple files to train the model. However this becomes tricky, due to the pipe, if I simultaneously try to profile using cProfiler
i.e.
python -m cProfile myscript.py ...
I know that I can send it multiple files using the -i
option, and iterate over the files, but then I will have to change the behaviour of the train()
method to avoid overwriting data.
Is there a good way to open an IO channel, for the lack of a better expression, that concatenates the input without explicitly reading and writing line by line?
Upvotes: 3
Views: 165
Reputation: 46901
you can chain
open files and use a generator to yield
open files from the filenames:
from itertools import chain
def yield_open(filenames):
for filename in filenames:
with open(filename, 'r') as file:
yield file
def train(file):
for line in file:
print(line, end='')
print()
files = chain.from_iterable(yield_open(filenames=['file1.txt', 'file2.txt']))
train(files)
this has the added benefit that only one of your files is open at the time.
you could also use that as a 'data pipeline' (may be more readable):
file_gen = yield_open(filenames=['file1.txt', 'file2.txt'])
files = chain.from_iterable(file_gen)
train(files)
Upvotes: 2