Reputation: 3449
I have two functions which both take iterators as inputs. Is there a way to write a generator which I can supply to both functions as input, which would not require a reset
or a second pass through? I want to do one pass over the data, but supply the output to two functions: Example:
def my_generator(data):
for row in data:
yield row
gen = my_generator(data)
func1(gen)
func2(gen)
I know I could have two different generator instances, or reset
in between functions, but was wondering if there is a way to avoid doing two passes on the data. Note that func1/func2 themselves are NOT generators, which would be nice cause I could then have a pipeline.
The point here is to try and avoid a second pass over the data.
Upvotes: 7
Views: 1391
Reputation: 129
It is a little bit too late, but maybe will be helpful for someone. For simplicity I have added only one ChildClass, but the idea is to have multiple of them:
class BaseClass:
def on_yield(self, value: int):
raise NotImplementedError()
def summary(self):
raise NotImplementedError()
class ChildClass(BaseClass):
def __init__(self):
self._aggregated_value = 0
def on_yield(self, value: int):
self._aggregated_value += value
def summary(self):
print(f"Aggregated value={self._aggregated_value}")
class Generator():
def my_generator(self, data):
for row in data:
yield row
def calculate(self, generator, classes):
for index, value in enumerate(generator):
print(f"index={index}")
[_class.on_yield(value) for _class in classes]
[_class.summary() for _class in classes]
if __name__ == '__main__':
child_classes = [ ChildClass(), ChildClass() ]
generator = Generator()
my_generator = generator.my_generator([1, 2, 3])
generator.calculate(my_generator, child_classes)
The output of this is:
index=0
index=1
index=2
Aggregated value=6
Aggregated value=6
Upvotes: 0
Reputation: 2581
You can either cache generators result into a list, or reset the generator to pass data into func2
. The problem is that if one have 2 loops, one needs to iterate over the data twice, so either one loads the data again and create a generator or one caches the entire result.
Solutions like itertools.tee
will also just create 2 iteratvies, which is basically the same as resetting the generator after first iteration. Of course it is syntactic sugar but it won't change the situation in the background.
If you have big data here, you have to merge func1
and func2
.
for a in gen:
f1(a)
f2(a)
In practice it can be a good idea to design code like this, so one has full control over iteration process and is able associate/compose maps and filters using a single iterative.
Upvotes: 3
Reputation: 5313
If using threads is an option, the generator may be consumed just once without having to store a possibly unpredictable number of yielded values between calls to the consumers. The following example runs the consumers in lock-step; Python 3.2 or later is needed for this implementation:
import threading
def generator():
for x in range(10):
print('generating {}'.format(x))
yield x
def tee(iterable, n=2):
barrier = threading.Barrier(n)
state = dict(value=None, stop_iteration=False)
def repeat():
while True:
if barrier.wait() == 0:
try:
state.update(value=next(iterable))
except StopIteration:
state.update(stop_iteration=True)
barrier.wait()
if state['stop_iteration']:
break
yield state['value']
return tuple(repeat() for i in range(n))
def func1(iterable):
for x in iterable:
print('func1 consuming {}'.format(x))
def func2(iterable):
for x in iterable:
print('func2 consuming {}'.format(x))
gen1, gen2 = tee(generator(), 2)
thread1 = threading.Thread(target=func1, args=(gen1,))
thread1.start()
thread2 = threading.Thread(target=func2, args=(gen2,))
thread2.start()
thread1.join()
thread2.join()
Upvotes: 3
Reputation: 126085
Python has an amazing catalog of handy functions. You find the ones related to iterators in itertools:
import itertools
def my_generator(data):
for row in data:
yield row
gen = my_generator(data)
gen1, gen2 = itertools.tee(gen)
func1(gen1)
func2(gen2)
However, this only makes sense if func1
and func2
don't consume all the elements, because if they do itertools.tee()
has to remember all the elements in gen
until gen2
is used.
To get around this, use only a few elements at a time. Or change func1
to call func2
. Or maybe even change func1
to be a lazy generator which returns the input and just pipe that into func2
.
Upvotes: 3