XZS
XZS

Reputation: 2504

zip_longest without fillvalue

I am searching for a middle ground between Python's zip and zip_longest functions (from the itertools module), that exhausts all given iterators, but does not fill in anything. So, for example, it should transpose tuples like so:

(11, 12, 13    ),        (11, 21, 31, 41),
(21, 22, 23, 24),  -->   (12, 22, 32, 42),
(31, 32        ),        (13, 23,     43),
(41, 42, 43, 44),        (    24,     44)

(Spaces added for nicer graphical alignment.)

I managed to compose a crude a solution by cleaning out the fillvalues after zip_longest.

def zip_discard(*iterables, sentinel = object()):
    return map(
            partial(filter, partial(is_not, sentinel)), 
            zip_longest(*iterables, fillvalue=sentinel))

Is there a way to do this without introducing the sentinels to begin with? Can this be improved using yield? Which approach seems most efficient?

Upvotes: 19

Views: 5623

Answers (3)

Guy Gangemi
Guy Gangemi

Reputation: 1773

Trying to solve this without a sentinel (not even in the interim) yielded some interesting insights.

m = (
  (11, 12, 13,       ),
  (21, 22, 23, 24, 25),
  (31, 32,           ),
  (41, 42, 43, 44,   ),
)

result = [[r[i] for r in m if r[i:i+1]] for i in range(0, max(len(r) for r in m))]

First insight, I've been using python for long enough that I'll constantly question the use of indices. There is always a better way. ALWAYS.

Second insight, a slice returns an empty tuple where ref by index would cause an IndexError. Replacing r[i:i+1] with len(r) < i would work in this example.

So what do I think is the better way? Using a fillvalue and then filter to strip out the value if you must.

t = tuple(zip_longest(*m))
t_clean = [list(filter(None, r)) for r in t]
m_again = tuple(zip_longest(*t))

The zip transposing idiom is not easy to grasp but worth learning. Transposing again yields the original which can't be done without a missing placeholder. Can also "skip" values. Considering we're all putting in spaces to make the matrix legible, we all want the placeholders there. If m was created with the infill value it would equal m_again, which is a nice property. Prioritise developer performance, the real bottle neck.

Upvotes: 1

Mike M&#252;ller
Mike M&#252;ller

Reputation: 85492

You approach is good. I think using a sentinel is elegant. I might be considered a bit more pythonic to use a nested generator expression:

def zip_discard_gen(*iterables, sentinel=object()):
    return ((entry for entry in iterable if entry is not sentinel)
            for iterable in zip_longest(*iterables, fillvalue=sentinel))

This needs fewer imports because there is no need for partial() or ne().

It is a bit faster too:

data = [(11, 12, 13    ),
        (21, 22, 23, 24),
        (31, 32        ),
        (41, 42, 43, 44)]

%timeit [list(x) for x in zip_discard(*data)]  
10000 loops, best of 3: 17.5 µs per loop

%timeit [list(x) for x in zip_discard_gen(*data)]
100000 loops, best of 3: 14.2 µs per loop

EDIT

A list comprehension version is a bit faster:

def zip_discard_compr(*iterables, sentinel=object()):
    return [[entry for entry in iterable if entry is not sentinel]
            for iterable in zip_longest(*iterables, fillvalue=sentinel)]

Timing:

%timeit zip_discard_compr(*data)
100000 loops, best of 3: 6.73 µs per loop

A Python 2 version:

from itertools import izip_longest

SENTINEL = object()

def zip_discard_compr(*iterables):
    sentinel = SENTINEL
    return [[entry for entry in iterable if entry is not sentinel]
            for iterable in izip_longest(*iterables, fillvalue=sentinel)]

Timings

This version returns the same data structure as zip_varlen by Tadhg McDonald-Jensen:

def zip_discard_gen(*iterables, sentinel=object()):
    return (tuple([entry for entry in iterable if entry is not sentinel])
            for iterable in zip_longest(*iterables, fillvalue=sentinel))

It is about twice as fast:

%timeit list(zip_discard_gen(*data))
100000 loops, best of 3: 9.37 µs per loop

%timeit list(zip_varlen(*data))
10000 loops, best of 3: 18 µs per loop

Upvotes: 7

Tadhg McDonald-Jensen
Tadhg McDonald-Jensen

Reputation: 21453

Both zip and zip_longest were designed to always generate tuples of equal length, you can define your own generator that doesn't care about the len with something like this:

def _one_pass(iters):
    for it in iters:
        try:
            yield next(it)
        except StopIteration:
            pass #of some of them are already exhausted then ignore it.

def zip_varlen(*iterables):
    iters = [iter(it) for it in iterables]
    while True: #broken when an empty tuple is given by _one_pass
        val = tuple(_one_pass(iters))
        if val:
            yield val
        else:
            break

If the data being zipped together is fairly large then skipping the exhausted iterators every time can be expensive, it may be more efficient to remove the finished iterators from iters in the _one_pass function like this:

def _one_pass(iters):
    i = 0
    while i<len(iters):
        try:
            yield next(iters[i])
        except StopIteration:
            del iters[i]
        else:
            i+=1

both of these versions would remove the need to create intermediate results or use temporary filler values.

Upvotes: 9

Related Questions