Reputation: 810

Merge two sorted iterators without replace

I need to merge two iterators. I wrote this function:

def merge_no_repeat(iter1, iter2, key=None):
    """
    a = iter([(2, 'a'), (4, 'a'), (6, 'a')])
    b = iter([(1, 'b'), (2, 'b'), (3, 'b'), (4, 'b'), (5, 'b'), (6, 'b'), (7, 'b'), (8, 'b')])
    key = lambda item: item[0]
    fusion_no_repeat(a, b, key) ->
                iter([(1, 'b'), (2, 'a'), (3, 'b'), (4, 'a'), (5, 'b'), (6, 'a'), (7, 'b'), (8, 'b')])
    :param iter1: sorted iterator
    :param iter2: sorted iterator
    :param key: lambda get sorted key, default: lambda x: x
    :return: merged iterator
    """
    if key is None:
        key = lambda x: x
    element1 = next(iter1, None)
    element2 = next(iter2, None)
    while element1 is not None or element2 is not None:
        if element1 is None:
            yield element2
            element2 = next(iter2, None)
        elif element2 is None:
            yield element1
            element1 = next(iter1, None)
        elif key(element1) > key(element2):
            yield element2
            element2 = next(iter2, None)
        elif key(element1) == key(element2):
            yield element1
            element1 = next(iter1, None)
            element2 = next(iter2, None)
        elif key(element1) < key(element2):
            yield element1
            element1 = next(iter1, None)

This function works. But I think it's too complicated. Is it possible to make this function easiest using Python Standard Libraries?

Upvotes: 5

Answers (4)

samer

Reputation: 21

As I'm just working on a problem that needs this, here is the solution I came up with. I worked out my problem in Haskell first. I believe looking at an idiomatic Haskell solution is instructive.

merge k [] ys = ys
merge k xs [] = xs
merge k (x:xs) (y:ys) =
  if k y < k x then y:merge k (x:xs) ys
  else x:merge k xs (if k y==k x then ys else (y:ys))

The arguments are just lists, but because Haskell is lazy, it works even with infinite lists. : is the Haskell list constructor, so x:xs is equivalent to [x]+xs in Python.

A recursive solution like this is a bit of a non-starter in Python, but we can preserve its essence while transforming the recursion into iteration by trampolining, like this:

from functools import partial

Nil = object()
def merge(xs, ys, k=lambda x:x):
    def nxt(it): return next(it, Nil)
    def tail(it): return lambda: (next(it), tail(it))
    def both(x, y):
        return ((y, tail(ys)) if x is Nil else
                (x, tail(xs)) if y is Nil else
                ((y, partial(both, x, nxt(ys))) if k(y) < k(x) else
                 (x, partial(both, nxt(xs), nxt(ys) if k(x) == k(y) else y))))
    to_yield, extract = both(nxt(xs), nxt(ys))
    while to_yield is not Nil:
        yield to_yield
        to_yield, extract = extract()

A few observations about this:

We create a new unique out-of-band value Nil instead of None so that the sequences can contain None values without interfering with the logic.
We use the two argument form of next with a value to return if the iterator is empty to avoid raising a StopIteration exception too early.
We trampoline by having the extract function return the next value to yield and the next extract function to use. While both iterables are active, the extract function is both.
This is Python 2 because that's what I use, but it does mean there's no yield from to play out one iterator when the other is empty. Instead, we switch into tail mode, which continues trampolining but relies on the single argument form of next raising StopIteration to terminate the process.

Here's an example of it in action:

>>> from itertools import islice, imap
>>> mul = lambda x: lambda y: x*y
>>> list(islice(merge(imap(mul(5), count()), imap(mul(3), count())), 20))
[0, 3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24, 25, 27, 30, 33, 35, 36, 39, 40]

Upvotes: 2

RemcoGerlich

Reputation: 31270

One, this fails if either of the iterators returns None, you should probably catch StopIteration exceptions. Two, once one of the iterators has no more values, you can just return all the rest of the values of the other one.

I think this is easier to make if you use a small wrapper class around an iterator that makes the next value visible:

class NextValueWrapper(object):
    def __init__(self, iterator):
        self.iterator = iterator
        self.next_value = None
        self.finished = False
        self.get()

    def get(self):
        if self.finished: return  # Shouldn't happen, maybe raise an exception
        value = self.next_value
        try:
            self.next_value = next(self.iterator)
        except StopIteration:
            self.finished = True
        return value

Then the code becomes:

def merge(iter1, iter2, key=None):
    if key is None:
        key = lambda x: x

    wrap1 = NextValueWrapper(iter1)
    wrap2 = NextValueWrapper(iter2)

    while not (wrap1.finished and wrap2.finished):
        if (wrap2.finished or 
               (not wrap1.finished and 
                key(wrap1.next_value) <= key(wrap2.next_value))):
            yield wrap1.get()
        else:
            yield wrap2.get()

This is untested. And it repeats. And it's Python 2, out of habit. Making it non-repeating is left as an exercise for the reader, I hadn't noticed that was a requirement too...

Upvotes: 1

dshepherd

Reputation: 5407

The pytoolz library has an implementation of this. It doesn't look like it uses any non-standard-library functions so if you really don't want to include an external library you could probably just copy the code.

If you're interested in speed there's also a cython implementation of pytoolz.

Upvotes: 2

301_Moved_Permanently

Reputation: 4186

You can use:

def merge_no_repeat(iter1, iter2, key=None):
    if key is None:
        key = lambda x: x
    ref = next(iter1, None)
    for elem in iter2:
        key_elem = key(elem) # caching value so we won't compute it for each value in iter1 that is before this one
        while ref is not None and key_elem > key(ref):
            # Catch up with low values from iter1
            yield ref
            ref = next(iter1, None)
        if ref is None or key_elem < key(ref):
            # Catch up with low values from iter2, eliminate duplicates
            yield elem
    # Update: I forgot to consume iter1 in the first version of this code
    for elem in iter1:
        # Use remaining items of iter1 if needed
        yield elem

I assumed that the iterators wouldn't return None values except when completely consumed, since you have if element1 is None: and elif element1 is None: tests in your original code.

Examples:

>>> from operator import itemgetter
>>> list(merge_no_repeat(
...     iter([(2, 'a'), (4, 'a'), (6, 'a')]),
...     iter([(1, 'b')]),
...     itemgetter(0)))
[(1, 'b'), (2, 'a'), (4, 'a'), (6, 'a')]
>>> list(merge_no_repeat(
...     iter([(2, 'a'), (4, 'a'), (6, 'a')]),
...     iter([(1, 'b'),(7, 'b'), (8, 'b')]),
...     itemgetter(0)))
[(1, 'b'), (2, 'a'), (4, 'a'), (6, 'a'), (7, 'b'), (8, 'b')]
>>> list(merge_no_repeat(
...     iter([(2, 'a'), (4, 'a'), (6, 'a')]),
...     iter([(1, 'b'),(3, 'b'), (4,'b'),(5,'b'),(7, 'b'), (8, 'b')]),
...     itemgetter(0)))
[(1, 'b'), (2, 'a'), (3, 'b'), (4, 'a'), (5, 'b'), (6, 'a'), (7, 'b'), (8, 'b')]

Upvotes: 0

Merge two sorted iterators without replace

Answers (4)

Related Questions