Reputation: 84550

Why can an itertools.groupby grouping only be iterated once?

I recently had to debug some code that went something like this:

for key, group in itertools.groupby(csvGrid, lambda x: x[0]):
    value1 = sum(row[1] for row in group)
    value2 = sum(row[2] for row in group)
    results.append([key, value1, value2])

In every result set, value2 came out as 0. When I looked into it, I found that the first time the code iterated over group, it consumed it, so that the second time there were zero elements to iterate over.

Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once. Is there any good reason why this is the case?

Upvotes: 3

Answers (4)

Jia

Reputation: 2581

I got the same issue when trying to access a "groupby" returned iterator multiple times. Based on Python3 doc , it suggests transfer iterator to list , so that is can be accessed later.

Upvotes: 0

user2357112

Reputation: 281046

itertools is an iterator library, and like just about everything else in the library, the itertools.groupby groups are iterators. There isn't a single function in all of itertools that returns a sequence.

The reasons the groupby groups are iterators are the same reasons everything else in itertools is an iterator:

It's more memory efficient.
The groups could be infinite.
You can get results immediately instead of waiting for the whole group to be ready.

Additionally, the groups are iterators because you might only want the keys, in which case materializing the groups would be a waste.

itertools.groupby is not intended to be an exact match for any LINQ construct, SQL clause, or other thing that goes by the name "group by". Its grouping behavior is closer to an extension of Unix's uniq command than what LINQ or SQL do, although the fact that it makes groups means it's not an exact match for uniq either.

As an example of something you could do with itertools.groupby that you couldn't with the other tools I've named, here's a run-length encoder:

def runlengthencode(iterable):
    for key, group in groupby(iterable):
        yield (key, sum(1 for val in group))

Upvotes: 7

tdelaney

Reputation: 77347

From the docs

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list

Interestingly, if you don't consume g yourself, groupby will do it before returning the next iteration.

>>> def vals():
...     for i in range(10):
...         print(i)
...         yield i
... 
>>> for k,g in itertools.groupby(vals(), lambda x: x<5):
...     print('processing group')
... 
0
processing group
1
2
3
4
5
processing group
6
7
8
9

Upvotes: 1

Frerich Raabe

Reputation: 94339

Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once.

That's correct.

Is there any good reason why this is the case?

It's potentially more memory efficient: you don't need to build an entire list first and then store it in memory, only to then iterate over it. Instead, you can process the elements as you iterate.
It's potentially more CPU efficient: by not generating all data up front, e.g. by producing a list, you can bail out early: if you find a particular group which matches some predicate, you can stop iteration - no further work needs to be done.

The decision of whether you need all data and iterate it multiple times is not hardcoded by the callee but is left to the caller.

Upvotes: 2

Why can an itertools.groupby grouping only be iterated once?

Answers (4)

Related Questions