Fred Foo
Fred Foo

Reputation: 363597

How can I count the number of items in an arbitrary iterable (such as a generator)?

Suppose I have an arbitrary iterable - for example, a generator that iterates over lines of a file and yields the ones matching a regex.

How can I count the number of items in that iterable, supposing that I don't care about the elements themselves?

Upvotes: 118

Views: 67408

Answers (10)

Stefan Pochmann
Stefan Pochmann

Reputation: 28606

Solution 1: Using batched from itertools seems even faster than the list solution, and uses only limited memory:

sum(map(len, batched(iterable, 1000)))

Solution 2: An itertools contraption recently adopted by more-itertools:

sum(compress(repeat(1), zip(iterable)))

The zip wraps input elements in a tuple to get true values, so that compress yields a 1 for every one of them.

Benchmark with iterable = repeat(None, 100000):

 0.32 ± 0.00 ms  ilen_batch_lengths
 0.38 ± 0.00 ms  ilen_list
 1.53 ± 0.01 ms  ilen_qr
 1.91 ± 0.01 ms  ilen_qr2
 2.06 ± 0.02 ms  ilen_sum_1_itertools
 2.24 ± 0.01 ms  ilen_indexOf_tuple
 3.07 ± 0.02 ms  ilen_countOf_True
 3.08 ± 0.04 ms  ilen_zip_count
 3.09 ± 0.02 ms  ilen_sum_bool_zip
 3.29 ± 0.03 ms  ilen_indexOf_None
 3.65 ± 0.08 ms  ilen_sum_1

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

Code:

def ilen_sum_1_itertools(iterable):
    return sum(compress(repeat(1), zip(iterable)))

def ilen_batch_lengths(iterable):
    return sum(map(len, batched(iterable, 1000)))

def ilen_zip_count(iterable):
    counter = count()
    deque(zip(iterable, counter), 0)
    return next(counter)

def ilen_list(iterable):
    return len(list(iterable))

def ilen_sum_1(iterable):
    return sum(1 for _ in iterable)

def ilen_sum_bool_zip(iterable):
    return sum(map(bool, zip(iterable)))

def ilen_countOf_True(iterable):
    return countOf(map(bool, zip(iterable)), True)

def ilen_indexOf_tuple(iterable):
    return indexOf(chain(zip(iterable), [()]), ())

def ilen_indexOf_None(iterable):
    return indexOf(chain(zip(iterable), [None]), None)

def ilen_qr(iterable):
    q = 0
    r = tuple(range(1000))
    rit = iter([0])
    def chunks():
        nonlocal q, rit
        while True:
            rit = iter(r)
            yield rit
            q += 1
    deque(zip(iterable, chain.from_iterable(chunks())), 0)
    return q * 1000 + next(rit, 1000)

def ilen_qr2(iterable):
    q = chain.from_iterable(map(repeat, count(), repeat(1000)))
    r = cycle(range(1000))
    deque(zip(iterable, q, r), 0)
    return next(q) * 1000 + next(r)

funcs = [
    ilen_zip_count,
    ilen_list,
    ilen_sum_1,
    ilen_sum_1_itertools,
    ilen_batch_lengths,
    ilen_sum_bool_zip,
    ilen_countOf_True,
    ilen_indexOf_tuple,
    ilen_indexOf_None,
    ilen_qr,
    ilen_qr2,
]

from itertools import *
from operator import countOf, indexOf
from timeit import timeit
from statistics import mean, stdev
from collections import deque
import sys
import random

for n in range(2345):
    for f in funcs:
        assert f(iter(repeat(None, n))) == n

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e3 for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.2f} ± {stdev(ts):4.2f} ms '
for _ in range(100):
    random.shuffle(funcs)
    for f in funcs:
        t = timeit(lambda: f(repeat(None, 100000)), number=1)
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)

print('\nPython:', sys.version)

Attempt This Online!

Upvotes: 1

Benjamin Drung
Benjamin Drung

Reputation: 189

The reduce() function from the functional programming functions map(), filter(), reduce(), zip() is the most appropriate one:

import functools
functools.reduce(lambda i, _: i + 1, it, 0)

This approach avoids building additional iterators or lists like the top-rated sum(1 for dummy in it).

Upvotes: 0

ShadowRanger
ShadowRanger

Reputation: 155418

Method that's meaningfully faster than sum(1 for _ in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs:

# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip

from collections import deque
from itertools import count

# Avoid constructing a deque each time, reduces fixed overhead enough
# that this beats the sum solution for all but length 0-1 inputs
consumeall = deque(maxlen=0).extend

def ilen(it):
    # Make a stateful counting iterator
    cnt = count()
    # zip it with the input iterator, then drain until input exhausted at C level
    consumeall(zip(it, cnt)) # cnt must be second zip arg to avoid advancing too far
    # Since count 0 based, the next value is the count
    return next(cnt)

Like len(list(it)) it performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.

It's surprisingly difficult to come up with fair test cases for comparing performance (list cheats using __length_hint__ which isn't likely to be available for arbitrary input iterables, itertools functions that don't provide __length_hint__ often have special operating modes that work faster when the value returned on each loop is released/freed before the next value is requested, which deque with maxlen=0 will do). The test case I used was to create a generator function that would take an input and return a C level generator that lacked special itertools return container optimizations or __length_hint__, using Python 3.3+'s yield from:

def no_opt_iter(it):
    yield from it

Then using ipython %timeit magic (substituting different constants for 100):

>>> %%timeit fakeinput = (0,) * 100
... ilen(no_opt_iter(fakeinput))

When the input isn't large enough that len(list(it)) would cause memory issues, on a Linux box running Python 3.9 x64, my solution takes about 50% longer than def ilen(it): return len(list(it)), regardless of input length.

For the smallest of inputs, the setup costs to load/call consumeall/zip/count/next means it takes infinitesimally longer this way than def ilen(it): sum(1 for _ in it) (about 40 ns more on my machine for a length 0 input, a 10% increase over the simple sum approach), but by the time you hit length 2 inputs, the cost is equivalent, and somewhere around length 30, the initial overhead is unnoticeable compared to the real work; the sum approach takes roughly 50% longer.

Basically, if memory use matters or inputs don't have bounded size and you care about speed more than brevity, use this solution. If inputs are bounded and smallish, len(list(it)) is probably best, and if they're unbounded, but simplicity/brevity counts, you'd use sum(1 for _ in it).

Upvotes: 51

Alexander
Alexander

Reputation: 374

In case you want to use the iterable elsewhere and know how many elements were consumed, you can create a simple wrapper class:

from collections.abc import Iterable, Iterator
from typing import Generic, TypeVar

_T = TypeVar("_T")


class IterCounter(Generic[_T]):
    """Iterator that keeps count of the consumed elements"""

    def __init__(self, iterable: Iterable[_T]) -> None:
        self._iterator = iter(iterable)
        self.count = 0

    def __iter__(self) -> Iterator[_T]:
        return self

    def __next__(self) -> _T:
        element = next(self._iterator)
        self.count += 1
        return element


counter = IterCounter(range(5))

print(counter.count)  # 0

print(list(counter))  # [0, 1, 2, 3, 4]

print(counter.count)  # 5

Upvotes: 1

Nikhil CSB
Nikhil CSB

Reputation: 175

len(list(it))

Although, it can hang up if it's an infinite generator.

Upvotes: 6

Sven Marnach
Sven Marnach

Reputation: 601679

Calls to itertools.imap() in Python 2 or map() in Python 3 can be replaced by equivalent generator expressions:

sum(1 for dummy in it)

This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

Upvotes: 214

prosti
prosti

Reputation: 46341

These would be my choices either one or another:

print(len([*gen]))
print(len(list(gen)))

Upvotes: 3

pylang
pylang

Reputation: 44505

more_itertools is a third-party library that implements an ilen tool. pip install more_itertools

import more_itertools as mit


mit.ilen(x for x in range(10))
# 10

Upvotes: 7

Erwin Mayer
Erwin Mayer

Reputation: 18670

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

Upvotes: 2

Greg Hewgill
Greg Hewgill

Reputation: 993163

A short way is:

def ilen(it):
    return len(list(it))

Note that if you are generating a lot of elements (say, tens of thousands or more), then putting them in a list may become a performance issue. However, this is a simple expression of the idea where the performance isn't going to matter for most cases.

Upvotes: 9

Related Questions