Coder123
Coder123

Reputation: 344

How to dump all results of a API request when there is a page limit?

I am using an API to pull data from a url, however the API has a pagination limit. It goes like:

I have a script which I can get the results of a page or per page but I want to automate it. I want to be able to loop through all the pages or per_page(500) and load it in to a json file.

Here is my code that can get 500 results per_page:

import json, pprint
import requests

url = "https://my_api.com/v1/users?per_page=500"
header = {"Authorization": "Bearer <my_api_token>"}

s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>" }

resp = s.get(url, headers=header, verify=False)
raw=resp.json()
for x in raw:
    print(x)

The output is 500 but is there a way to keep going and pull the results starting from where it left off? Or even go by page and get all the data per page until there's no data in a page?

Upvotes: 0

Views: 6323

Answers (1)

Skyler
Skyler

Reputation: 75

It will be helpful, if you present a sample response from your API.


If the API is equipped properly, there will be a next property in a given response that leads you to the next page.

You can then keep calling the API with the link given in the next recursively. On the last page, there will be no next in the Link header.

resp.links["next"]["url"] will give you the URL to the next page.

For example, the GitHub API has next, last, first, and prev properties.

To put it into code, first, you need to turn your code into functions.

Given that there is a maximum of 500 results per page, it implies you are extracting a list of data of some sort from the API. Often, these data are returned in a list somewhere inside raw.

For now, let's assume you want to extract all elements inside a list at raw.get('data').

import requests

header = {"Authorization": "Bearer <my_api_token>"}

results_per_page = 500


def compose_url():
    return (
        "https://my_api.com/v1/users"
        + "?per_page="
        + str(results_per_page)
        + "&page_number="
        + "1"
    )


def get_result(url=None):
    if url_get is None:
        url_get = compose_url()
    else:
        url_get = url
    s = requests.Session()
    s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
    resp = s.get(url_get, headers=header, verify=False)

    # You may also want to check the status code
    if resp.status_code != 200:
        raise Exception(resp.status_code)

    raw = resp.json()  # of type dict
    data = raw.get("data")  # of type list

    if not "url" in resp.links.get("next"):
        # We are at the last page, return data
        return data

    # Otherwise, recursively get results from the next url
    return data + get_result(resp.links["next"]["url"])  # concat lists


def main():
    # Driver function
    data = get_result()
    # Then you can print the data or save it to a file


if __name__ == "__main__":
    # Now run the driver function
    main()

However, if there isn't a proper Link header, I see 2 solutions: (1) recursion and (2) loop.

I'll demonstrate recursion.

As you have mentioned, when there is pagination in API responses, i.e. when there is a limit of maximum number of results per page, there is often a query parameter called page number or start index of some sort to indicate which "page" you are querying, so we'll utilize the page_number parameter in the code.

The logic is:

  • Given a HTTP request response, if there is less than 500 results, it means there is no more pages. Return the results.
  • If there are 500 results in a given response, it means there's probably another page, so we advance page_number by 1 and do a recursion (by calling the function itself) and concatenate with previous results.
import requests

header = {"Authorization": "Bearer <my_api_token>"}

results_per_page = 500


def compose_url(results_per_page, current_page_number):
    return (
        "https://my_api.com/v1/users"
        + "?per_page="
        + str(results_per_page)
        + "&page_number="
        + str(current_page_number)
    )


def get_result(current_page_number):
    s = requests.Session()
    s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
    url = compose_url(results_per_page, current_page_number)
    resp = s.get(url, headers=header, verify=False)

    # You may also want to check the status code
    if resp.status_code != 200:
        raise Exception(resp.status_code)

    raw = resp.json()  # of type dict
    data = raw.get("data")  # of type list

    # If the length of data is smaller than results_per_page (500 of them), 
    # that means there is no more pages
    if len(data) < results_per_page:
        return data

    # Otherwise, advance the page number and do a recursion
    return data + get_result(current_page_number + 1)  # concat lists


def main():
    # Driver function
    data = get_result(1)
    # Then you can print the data or save it to a file


if __name__ == "__main__":
    # Now run the driver function
    main()

If you truly want to store the raw responses, you can. However, you'll still need to check the number of results in a given response. The logic is similar. If a given raw contains 500 results, it means there is probably another page. We advance the page number by 1 and do a recursion.

Let's still assume raw.get('data') is the list whose length is the number of results.

Because JSON/dictionary files cannot be simply concatenated, you can store raw (which is a dictionary) of each page into a list of raws. You can then parse and synthesize the data in whatever way you want.

Use the following get_result function:

def get_result(current_page_number):
    s = requests.Session()
    s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
    url = compose_url(results_per_page, current_page_number)
    resp = s.get(url, headers=header, verify=False)

    # You may also want to check the status code
    if resp.status_code != 200:
        raise Exception(resp.status_code)

    raw = resp.json()  # of type dict
    data = raw.get("data")  # of type list

    if len(data) == results_per_page:
        return [raw] + get_result(current_page_number + 1) # concat lists

    return [raw] # convert raw into a list object on the fly

As for the loop method, the logic is similar to recursion. Essentially, you will call the get_result() function a number of times, collect the results, and break early when the furthest page contains less than 500 results.

If you know the total number of results in advance, you can simply the run the loop for a predetermined number of times.


Do you follow? Do you have any further questions?

(I'm a little confused by what you mean by "load it into a JSON file". Do you mean saving the final raw results into a JSON file? Or are you referring to the .json() method in resp.json()? In that case, you don't need import json to do resp.json(). The .json() method on resp is actually part of the requests module.

On a bonus point, you can make your HTTP requests asynchronous, but this is slightly beyond the scope of your original question.


P.s. I'm happy to learn what other solutions, perhaps more elegant ones, that people use.

Upvotes: 2

Related Questions