Reputation: 2049
I am building an API client in python for some api which provides the following layout of data, when it comes in multiple pages:
{
"data":["some","pieces","of","data],
"results_per_page=2500,
"total_results": 10000
"next_url": "http://fullyqualifiedurl.com/results_after=5000"
"previous_url": "http://fullyqualifiedurl.com/results_after=2500
}
I want to have an iterator which a client can call like this:
>>> results = client.results()
>>> result_count = 0
>>> for result in results:
>>> result_count += 1
>>> print(result_count)
10000
In which the iterator silently requests new page data as it reaches the end of its current page.
I have developed something which yields pages, but on subsequent calls, I want to not have to re-fetch the data. Here is what I have:
Class Iterator:
def __init__(self, current_page, max_results=None):
self.current_page = current_page
self.max_results = max_results
self.yielded_count = 0
def _iter_items(self):
for page in self._iter_page():
for item in page:
# early break from page if we have set a limit.
if self._limit_reached():
raise StopIteration
self.yielded_count += 1
yield item
def _iter_page(self):
while self.current_page is not None:
yield self.current_page
if self._has_next_page():
self.current_page = self._get_next_page()
else:
self.current_page = None
def __iter__(self):
return self._iter_items()
def __next__(self):
return next(self._iter_items())
def _iter_page(self):
while self.current_page is not None:
yield self.current_page
if self._has_next_page():
self.current_page = self._get_next_page()
else:
self.current_page = None
def _get_next_page(self):
if self.current_page.next_page_url:
return self.api_request(self.current_page.next_page_url)
else:
return None
def _keep_iterating(self):
return (
self.current_page is not None
and self.max_results
and self.yielded_count >= self.max_results
)
def _limit_reached(self):
return self.max_results and self.yielded_count >= self.max_results
class Page:
def __init__(self, json_data, *args, **kwargs):
self.client = kwargs.get("client")
self.next_page_url = json_data["pages"]["next_url"]
self.previous_page_url = json_data["pages"]["previous_url"]
self.total_count = json_data["total_count"]
self._data_iterator = iter(datum for datum in json_data["data"])
def __iter__(self):
return self
def __next__(self):
item = next(self._data_iterator)
return item
What's happening right now is i can successfully iterate over it once, but upon second iteration, the iterator is empty. I would like it to cache the results upon first search, and allow subsequent iterations. Am I going about this the entirely wrong way? I feel like there should be an established pattern for this, but can't really find anything.
Upvotes: 0
Views: 138
Reputation: 365707
I'm not sure whether you're talking about the Page
type or the Iterator
type here, because they're both iterators, and both have the same issues, and you've only given us a vague description of what you're doing with whichever one you're doing it with. But all of the following will apply just as well to either of them (except for one note), so I'll talk about Page
, because it's the simpler one.
An iterator can only be used once. That's inherent in what it means to be an iterator.
You can use tee
to split off a second iterator, which caches the values from the first one.
But if your goal is to iterate over the same values over and over, there's a much simpler solution: just copy the iterator into a sequence, like a list
or tuple
Then you can iterate that as many times as you want.
page = list(Page(data, …))
for thing in page:
print(thing)
for thing in page:
print(thing)
While we're at it, your Iterator
is not a valid iterator:
def __iter__(self):
return self._iter_items()
def __next__(self):
return next(self._iter_items())
An iterator must return self
from __iter__
, the way your Page
does. Python doesn't enforce that rule, so if you get this wrong, you often end up with something that seems to work in one test, but then does the wrong thing somewhere else.
Alternatively… are you sure you want Page
to be an iterator, rather than a reusable, non-iterator iterable?
class Page:
def __init__(self, json_data, *args, **kwargs):
self.client = kwargs.get("client")
self.next_page_url = json_data["pages"]["next_url"]
self.previous_page_url = json_data["pages"]["previous_url"]
self.total_count = json_data["total_count"]
def __iter__(self):
return iter(datum for datum in json_data["data"])
Now, you don't need to copy the data into a list
unless you want to do list-y things like indexing it in random order:
page = Page(data, …)
for thing in page:
print(thing)
for thing in page:
print(thing)
As a side note, this is repetitive:
iter(datum for datum in json_data["data"])
That (datum for datum in json_data["data"])
is just the same things as json_data["data"]
, wrapped in a generator expression. Since a generator expression is already an iterator, you can just return it:
return (datum for datum in json_data["data"])
Or, even simpler, you can just return an iterator over the data:
return iter(json_data["data"])
And if you actually want list-y sequence behavior, you can even make it a full-fledged Sequence
pretty easily:
class Page:
def __init__(self, json_data, *args, **kwargs):
self.client = kwargs.get("client")
self.next_page_url = json_data["pages"]["next_url"]
self.previous_page_url = json_data["pages"]["previous_url"]
self.total_count = json_data["total_count"]
def __len__(self):
return len(json_data["data"])
def __getitem__(self, index):
return json_data["data"][index]
And now:
page = Page(data, …)
for thing in page:
print(thing)
for thing in reversed(page):
print(thing)
for thing in page[-6:-2]:
print(thing)
Upvotes: 2