Baz
Baz

Reputation: 13135

Establishing the number of values yielded by a generator

Suppose I have code along these lines:

counter = Counter()
text = f.read()
words = words_generator(text)
interesting_words = filter_generator(words)
counter.update(interesting_words)

for i in counter:
    print("Frequency for "+i ": "+counter[i]/sum)

How should I best set the value of sum which is the number of values yielded by words_generator?

Upvotes: 1

Views: 212

Answers (3)

Veedrac
Veedrac

Reputation: 60147

from collections import Counter

class CountItemsWrapper:
    def __init__(self, items):
        self.items = iter(items)
        self.count = 0

    def __next__(self):
        res = next(self.items)
        self.count += 1
        return res

    def __iter__(self):
        return self

counter = Counter()
text = f.read()
words = CountItemsWrapper(words_generator(text))
interesting_words = filter_generator(words)
counter.update(interesting_words)

for i in counter:
    print("Frequency for "+i ": "+counter[i]/words.count)

Basically, CountItemsWrapper is an iterator that just passes through values, but keeps count whenever it does.

You can then just use the count attribute on the wrapper as your sum.


Explanation of the class:

def __init__(self, items):
    self.items = iter(items)
    self.count = 0

This is simple. Keep in mind that instances are iterators, not just iterables. So this iterates once, keeping count once.


def __next__(self):
    res = next(self.items)
    self.count += 1
    return res

This is called to get the next item.self.count must be added after the call to next because we allow the StopIteration to propagate and don't want to add to the count if we haven't yielded a value.


def __iter__(self):
    return self

This is an iterator so it returns itself.

Upvotes: 4

Bakuriu
Bakuriu

Reputation: 101959

The simplest solution is to build a list:

words = list(words_generator(text))

An other option is to use itertools.tee:

words, words_copy = itertools.tee(words_generator(text))

Afterwards you can use both copy of the iterable. However note that if you first iterate completely over a copy then it will be faster and more memory efficient to simply build the list. To see any gain memory-wise you should somehow iterate on both copies "at the same time". For example something like:

filtered = filter_generator(words)
total = 0
for word, _ in zip(filtered, words_copy): # use itertools.izip in python2
    counter[word] += 1
    total += 1
total += sum(1 for _ in words_copy)

Which uses at most O(n-k) memory where n is the number of words in the text and k is the number of interesting words in the text. You may simplify the code a bit using:

from itertools import zip_longest #izip_longest in python2
filtered = filter_generator(words)
total = 0
for word, _ in zip_longest(filtered, words_copy):
    counter[word] += 1
    total += 1
del counter[None]

Which uses only O(1) memory(if the generators are constant-space).

Note however that having explicit loops will slow down the code, so in the end, if memory is not an option, building a list for words may be the better solution.

Upvotes: 0

bruno desthuilliers
bruno desthuilliers

Reputation: 77912

Q&D posssible technical solution : wrap your generator into an iterable that keeps track of the number of items seens, ie:

class IterCount(object):
    def __init__(self, iterable):
        self._iterable = iterable
        self._count = 0

    def _itercount(self):
        for value in self._iterable:
            self._count += 1
            yield value

    def __iter__(self):
        return self._itercount()

    @property
    def count(self):
        return self._count


itc1 = IterCount(range(10))
print list(itc1)
print itc1.count

itc2 = IterCount(xrange(10))
print list(itc2)
print itc2.count

Upvotes: 2

Related Questions