Reputation: 84550
I recently had to debug some code that went something like this:
for key, group in itertools.groupby(csvGrid, lambda x: x[0]):
value1 = sum(row[1] for row in group)
value2 = sum(row[2] for row in group)
results.append([key, value1, value2])
In every result set, value2
came out as 0
. When I looked into it, I found that the first time the code iterated over group
, it consumed it, so that the second time there were zero elements to iterate over.
Intuitively, I would expect group
to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once. Is there any good reason why this is the case?
Upvotes: 3
Views: 2427
Reputation: 2581
I got the same issue when trying to access a "groupby" returned iterator multiple times. Based on Python3 doc , it suggests transfer iterator to list , so that is can be accessed later.
Upvotes: 0
Reputation: 281046
itertools
is an iterator library, and like just about everything else in the library, the itertools.groupby
groups are iterators. There isn't a single function in all of itertools
that returns a sequence.
The reasons the groupby groups are iterators are the same reasons everything else in itertools is an iterator:
Additionally, the groups are iterators because you might only want the keys, in which case materializing the groups would be a waste.
itertools.groupby
is not intended to be an exact match for any LINQ construct, SQL clause, or other thing that goes by the name "group by". Its grouping behavior is closer to an extension of Unix's uniq
command than what LINQ or SQL do, although the fact that it makes groups means it's not an exact match for uniq
either.
As an example of something you could do with itertools.groupby
that you couldn't with the other tools I've named, here's a run-length encoder:
def runlengthencode(iterable):
for key, group in groupby(iterable):
yield (key, sum(1 for val in group))
Upvotes: 7
Reputation: 77347
From the docs
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
Interestingly, if you don't consume g
yourself, groupby
will do it before returning the next iteration.
>>> def vals():
... for i in range(10):
... print(i)
... yield i
...
>>> for k,g in itertools.groupby(vals(), lambda x: x<5):
... print('processing group')
...
0
processing group
1
2
3
4
5
processing group
6
7
8
9
Upvotes: 1
Reputation: 94339
Intuitively, I would expect group to be a list which can be iterated over an indefinite number of times, but instead it behaves like an iterator which can only be iterated once.
That's correct.
Is there any good reason why this is the case?
It's potentially more memory efficient: you don't need to build an entire list first and then store it in memory, only to then iterate over it. Instead, you can process the elements as you iterate.
It's potentially more CPU efficient: by not generating all data up front, e.g. by producing a list, you can bail out early: if you find a particular group which matches some predicate, you can stop iteration - no further work needs to be done.
The decision of whether you need all data and iterate it multiple times is not hardcoded by the callee but is left to the caller.
Upvotes: 2