Looping through a paginated api asynchronously

Question

I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:

while c <= limit:
    if not api_url:
        break

    req = urllib2.Request(api_url)
    opener = urllib2.build_opener()
    f = opener.open(req)
    response = simplejson.load(f)

    for item in response['documents']:
        # DO SOMETHING HERE 
    if 'more_url' in response:
        api_url = response['more_url']
    else:
        api_url = None
        break
    c += 1

Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.

Jerzyk · Accepted Answer

What you have here is that you do not know up front about what will be read next unless you will call an API. Think of this like, what you can do in parallel?

I do not know how much you can do in parallel and which tasks, but lets try...

some assumptions: - you can retrieve data from the API without penalties or limits - data processing of one page/batch can be done independently one from other

what is slow is an IO - so immediately you can split your code to two parallel running tasks - one that will read data, then put it in the queue and continue reading unless hit limit/empty response or pause if queue is full

then second task, that is taking data from queue, and do something with data

so you can call one task from another

other approach is that you have one task, that is calling other one immediately after data is read, so their execution will be running in parallel but slightly shifted

how I'll implement it? as celery tasks and yes requests

e.g. the second one:

@task
def do_data_process(data):
   # do something with data
   pass

@task
def parse_one_page(url):
    response = requests.get(url)
    data = response.json()

    if 'more_url' in data:
        parse_one_page.delay(data['more_url'])

    # and here do data processing in this task
    do_data_process(data)
    # or call worker and try to do this in other process
    # do_data_process.delay(data)

and it is up to you how many tasks you will run in parallel if you will add limits to your code, you can even have workers on multiple machines and have separate queues for parse_one_page and do_data_process

why this approach, not twisted or async?

because you have cpu-bond data processing (json, then data) and for this is better to have separate processes and celery is perfect with them.

Looping through a paginated api asynchronously

Answers (1)

Related Questions