Reputation: 810
I need to merge two iterators. I wrote this function:
def merge_no_repeat(iter1, iter2, key=None):
"""
a = iter([(2, 'a'), (4, 'a'), (6, 'a')])
b = iter([(1, 'b'), (2, 'b'), (3, 'b'), (4, 'b'), (5, 'b'), (6, 'b'), (7, 'b'), (8, 'b')])
key = lambda item: item[0]
fusion_no_repeat(a, b, key) ->
iter([(1, 'b'), (2, 'a'), (3, 'b'), (4, 'a'), (5, 'b'), (6, 'a'), (7, 'b'), (8, 'b')])
:param iter1: sorted iterator
:param iter2: sorted iterator
:param key: lambda get sorted key, default: lambda x: x
:return: merged iterator
"""
if key is None:
key = lambda x: x
element1 = next(iter1, None)
element2 = next(iter2, None)
while element1 is not None or element2 is not None:
if element1 is None:
yield element2
element2 = next(iter2, None)
elif element2 is None:
yield element1
element1 = next(iter1, None)
elif key(element1) > key(element2):
yield element2
element2 = next(iter2, None)
elif key(element1) == key(element2):
yield element1
element1 = next(iter1, None)
element2 = next(iter2, None)
elif key(element1) < key(element2):
yield element1
element1 = next(iter1, None)
This function works. But I think it's too complicated. Is it possible to make this function easiest using Python Standard Libraries?
Upvotes: 5
Views: 901
Reputation: 21
As I'm just working on a problem that needs this, here is the solution I came up with. I worked out my problem in Haskell first. I believe looking at an idiomatic Haskell solution is instructive.
merge k [] ys = ys
merge k xs [] = xs
merge k (x:xs) (y:ys) =
if k y < k x then y:merge k (x:xs) ys
else x:merge k xs (if k y==k x then ys else (y:ys))
The arguments are just lists, but because Haskell is lazy, it works even with infinite lists. :
is the Haskell list constructor, so x:xs
is equivalent to [x]+xs
in Python.
A recursive solution like this is a bit of a non-starter in Python, but we can preserve its essence while transforming the recursion into iteration by trampolining, like this:
from functools import partial
Nil = object()
def merge(xs, ys, k=lambda x:x):
def nxt(it): return next(it, Nil)
def tail(it): return lambda: (next(it), tail(it))
def both(x, y):
return ((y, tail(ys)) if x is Nil else
(x, tail(xs)) if y is Nil else
((y, partial(both, x, nxt(ys))) if k(y) < k(x) else
(x, partial(both, nxt(xs), nxt(ys) if k(x) == k(y) else y))))
to_yield, extract = both(nxt(xs), nxt(ys))
while to_yield is not Nil:
yield to_yield
to_yield, extract = extract()
A few observations about this:
Nil
instead of None
so that the sequences can contain None
values without interfering with the logic.next
with a value to return if the iterator is empty to avoid raising a StopIteration
exception too early.extract
function return the next value to yield and the next extract
function to use. While both iterables are active, the extract
function is both
.yield from
to play out one iterator when the other is empty. Instead, we switch into tail
mode, which continues trampolining but relies on the single argument form of next
raising StopIteration
to terminate the process.Here's an example of it in action:
>>> from itertools import islice, imap
>>> mul = lambda x: lambda y: x*y
>>> list(islice(merge(imap(mul(5), count()), imap(mul(3), count())), 20))
[0, 3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24, 25, 27, 30, 33, 35, 36, 39, 40]
Upvotes: 2
Reputation: 31270
One, this fails if either of the iterators returns None, you should probably catch StopIteration exceptions. Two, once one of the iterators has no more values, you can just return all the rest of the values of the other one.
I think this is easier to make if you use a small wrapper class around an iterator that makes the next value visible:
class NextValueWrapper(object):
def __init__(self, iterator):
self.iterator = iterator
self.next_value = None
self.finished = False
self.get()
def get(self):
if self.finished: return # Shouldn't happen, maybe raise an exception
value = self.next_value
try:
self.next_value = next(self.iterator)
except StopIteration:
self.finished = True
return value
Then the code becomes:
def merge(iter1, iter2, key=None):
if key is None:
key = lambda x: x
wrap1 = NextValueWrapper(iter1)
wrap2 = NextValueWrapper(iter2)
while not (wrap1.finished and wrap2.finished):
if (wrap2.finished or
(not wrap1.finished and
key(wrap1.next_value) <= key(wrap2.next_value))):
yield wrap1.get()
else:
yield wrap2.get()
This is untested. And it repeats. And it's Python 2, out of habit. Making it non-repeating is left as an exercise for the reader, I hadn't noticed that was a requirement too...
Upvotes: 1
Reputation: 5407
The pytoolz library has an implementation of this. It doesn't look like it uses any non-standard-library functions so if you really don't want to include an external library you could probably just copy the code.
If you're interested in speed there's also a cython implementation of pytoolz.
Upvotes: 2
Reputation: 4186
You can use:
def merge_no_repeat(iter1, iter2, key=None):
if key is None:
key = lambda x: x
ref = next(iter1, None)
for elem in iter2:
key_elem = key(elem) # caching value so we won't compute it for each value in iter1 that is before this one
while ref is not None and key_elem > key(ref):
# Catch up with low values from iter1
yield ref
ref = next(iter1, None)
if ref is None or key_elem < key(ref):
# Catch up with low values from iter2, eliminate duplicates
yield elem
# Update: I forgot to consume iter1 in the first version of this code
for elem in iter1:
# Use remaining items of iter1 if needed
yield elem
I assumed that the iterators wouldn't return None
values except when completely consumed, since you have if element1 is None:
and elif element1 is None:
tests in your original code.
Examples:
>>> from operator import itemgetter
>>> list(merge_no_repeat(
... iter([(2, 'a'), (4, 'a'), (6, 'a')]),
... iter([(1, 'b')]),
... itemgetter(0)))
[(1, 'b'), (2, 'a'), (4, 'a'), (6, 'a')]
>>> list(merge_no_repeat(
... iter([(2, 'a'), (4, 'a'), (6, 'a')]),
... iter([(1, 'b'),(7, 'b'), (8, 'b')]),
... itemgetter(0)))
[(1, 'b'), (2, 'a'), (4, 'a'), (6, 'a'), (7, 'b'), (8, 'b')]
>>> list(merge_no_repeat(
... iter([(2, 'a'), (4, 'a'), (6, 'a')]),
... iter([(1, 'b'),(3, 'b'), (4,'b'),(5,'b'),(7, 'b'), (8, 'b')]),
... itemgetter(0)))
[(1, 'b'), (2, 'a'), (3, 'b'), (4, 'a'), (5, 'b'), (6, 'a'), (7, 'b'), (8, 'b')]
Upvotes: 0