iterator to mask multiple pages of an API

Question

I am building an API client in python for some api which provides the following layout of data, when it comes in multiple pages:

{
    "data":["some","pieces","of","data],
    "results_per_page=2500,
    "total_results": 10000
    "next_url": "http://fullyqualifiedurl.com/results_after=5000"
    "previous_url": "http://fullyqualifiedurl.com/results_after=2500
}

I want to have an iterator which a client can call like this:

>>> results = client.results()
>>> result_count = 0
>>> for result in results:
>>>     result_count += 1
>>> print(result_count)
10000

In which the iterator silently requests new page data as it reaches the end of its current page.

I have developed something which yields pages, but on subsequent calls, I want to not have to re-fetch the data. Here is what I have:

Class Iterator:
    def __init__(self, current_page, max_results=None):
        self.current_page = current_page
        self.max_results = max_results
        self.yielded_count = 0

    def _iter_items(self):
        for page in self._iter_page():
            for item in page:
                # early break from page if we have set a limit.
                if self._limit_reached():
                    raise StopIteration
                self.yielded_count += 1
                yield item

    def _iter_page(self):
        while self.current_page is not None:
            yield self.current_page
            if self._has_next_page():
                self.current_page = self._get_next_page()
            else:
                self.current_page = None

    def __iter__(self):
        return self._iter_items()

    def __next__(self):
        return next(self._iter_items())

    def _iter_page(self):
        while self.current_page is not None:
            yield self.current_page
            if self._has_next_page():
                self.current_page = self._get_next_page()
            else:
                self.current_page = None

    def _get_next_page(self):
        if self.current_page.next_page_url:
            return self.api_request(self.current_page.next_page_url)
        else:
            return None

    def _keep_iterating(self):
        return (
            self.current_page is not None
            and self.max_results
            and self.yielded_count >= self.max_results
    )

    def _limit_reached(self):
        return self.max_results and self.yielded_count >= self.max_results

class Page:

    def __init__(self, json_data, *args, **kwargs):
        self.client = kwargs.get("client")
        self.next_page_url = json_data["pages"]["next_url"]
        self.previous_page_url = json_data["pages"]["previous_url"]
        self.total_count = json_data["total_count"]
        self._data_iterator = iter(datum for datum in json_data["data"])

    def __iter__(self):
        return self

    def __next__(self):
        item = next(self._data_iterator)
        return item

What's happening right now is i can successfully iterate over it once, but upon second iteration, the iterator is empty. I would like it to cache the results upon first search, and allow subsequent iterations. Am I going about this the entirely wrong way? I feel like there should be an established pattern for this, but can't really find anything.

abarnert · Accepted Answer

I'm not sure whether you're talking about the Page type or the Iterator type here, because they're both iterators, and both have the same issues, and you've only given us a vague description of what you're doing with whichever one you're doing it with. But all of the following will apply just as well to either of them (except for one note), so I'll talk about Page, because it's the simpler one.

An iterator can only be used once. That's inherent in what it means to be an iterator.

You can use tee to split off a second iterator, which caches the values from the first one.

But if your goal is to iterate over the same values over and over, there's a much simpler solution: just copy the iterator into a sequence, like a list or tuple Then you can iterate that as many times as you want.

page = list(Page(data, …))
for thing in page:
    print(thing)
for thing in page:
    print(thing)

While we're at it, your Iterator is not a valid iterator:

def __iter__(self):
    return self._iter_items()

def __next__(self):
    return next(self._iter_items())

An iterator must return self from __iter__, the way your Page does. Python doesn't enforce that rule, so if you get this wrong, you often end up with something that seems to work in one test, but then does the wrong thing somewhere else.

Alternatively… are you sure you want Page to be an iterator, rather than a reusable, non-iterator iterable?

class Page:

    def __init__(self, json_data, *args, **kwargs):
        self.client = kwargs.get("client")
        self.next_page_url = json_data["pages"]["next_url"]
        self.previous_page_url = json_data["pages"]["previous_url"]
        self.total_count = json_data["total_count"]

    def __iter__(self):
        return iter(datum for datum in json_data["data"])

Now, you don't need to copy the data into a list unless you want to do list-y things like indexing it in random order:

page = Page(data, …)
for thing in page:
    print(thing)
for thing in page:
    print(thing)

As a side note, this is repetitive:

iter(datum for datum in json_data["data"])

That (datum for datum in json_data["data"]) is just the same things as json_data["data"], wrapped in a generator expression. Since a generator expression is already an iterator, you can just return it:

return (datum for datum in json_data["data"])

Or, even simpler, you can just return an iterator over the data:

return iter(json_data["data"])

And if you actually want list-y sequence behavior, you can even make it a full-fledged Sequence pretty easily:

class Page:

    def __init__(self, json_data, *args, **kwargs):
        self.client = kwargs.get("client")
        self.next_page_url = json_data["pages"]["next_url"]
        self.previous_page_url = json_data["pages"]["previous_url"]
        self.total_count = json_data["total_count"]

    def __len__(self):
        return len(json_data["data"])

    def __getitem__(self, index):
        return json_data["data"][index]

And now:

page = Page(data, …)
for thing in page:
    print(thing)
for thing in reversed(page):
    print(thing)
for thing in page[-6:-2]:
    print(thing)

iterator to mask multiple pages of an API

Answers (1)

Related Questions